{"title": "The Multi-fidelity Multi-armed Bandit", "book": "Advances in Neural Information Processing Systems", "page_first": 1777, "page_last": 1785, "abstract": "We study a variant of the classical stochastic $K$-armed bandit where observing the outcome of each arm is expensive, but cheap approximations to this outcome are available. For example, in online advertising the performance of an ad can be approximated by displaying it for shorter time periods or to narrower audiences. We formalise this task as a \\emph{multi-fidelity} bandit, where, at each time step, the forecaster may choose to play an arm at any one of $M$ fidelities. The highest fidelity (desired outcome) expends cost $\\costM$. The $m$\\ssth fidelity (an approximation) expends $\\costm < \\costM$ and returns a biased estimate of the highest fidelity. We develop \\mfucb, a novel upper confidence bound procedure for this setting and prove that it naturally adapts to the sequence of available approximations and costs thus attaining better regret than naive strategies which ignore the approximations. For instance, in the above online advertising example, \\mfucbs would use the lower fidelities to quickly eliminate suboptimal ads and reserve the larger expensive experiments on a small set of promising candidates. We complement this result with a lower bound and show that \\mfucbs is nearly optimal under certain conditions.", "full_text": "The Multi-\ufb01delity Multi-armed Bandit\n\nKirthevasan Kandasamy (cid:92), Gautam Dasarathy \u2666, Jeff Schneider (cid:92), Barnab\u00e1s P\u00f3czos (cid:92)\n\n(cid:92) Carnegie Mellon University, \u2666 Rice University\n\n{kandasamy, schneide, bapoczos}@cs.cmu.edu, gautamd@rice.edu\n\nAbstract\n\nWe study a variant of the classical stochastic K-armed bandit where observing\nthe outcome of each arm is expensive, but cheap approximations to this outcome\nare available. For example, in online advertising the performance of an ad can be\napproximated by displaying it for shorter time periods or to narrower audiences.\nWe formalise this task as a multi-\ufb01delity bandit, where, at each time step, the\nforecaster may choose to play an arm at any one of M \ufb01delities. The highest\n\ufb01delity (desired outcome) expends cost \u03bb(M ). The mth \ufb01delity (an approximation)\nexpends \u03bb(m) < \u03bb(M ) and returns a biased estimate of the highest \ufb01delity. We\ndevelop MF-UCB, a novel upper con\ufb01dence bound procedure for this setting and\nprove that it naturally adapts to the sequence of available approximations and costs\nthus attaining better regret than naive strategies which ignore the approximations.\nFor instance, in the above online advertising example, MF-UCB would use the\nlower \ufb01delities to quickly eliminate suboptimal ads and reserve the larger expensive\nexperiments on a small set of promising candidates. We complement this result with\na lower bound and show that MF-UCB is nearly optimal under certain conditions.\n\n1\n\nIntroduction\n\nSince the seminal work of Robbins [11], the multi-armed bandit has become an attractive framework\nfor studying exploration-exploitation trade-offs inherent to tasks arising in online advertising, \ufb01nance\nand other \ufb01elds. In the most basic form of the K-armed bandit [9, 12], we have a set K = {1, . . . , K}\nof K arms (e.g. K ads in online advertising). At each time step t = 1, 2, . . . , an arm is played and a\ncorresponding reward is realised. The goal is to design a strategy of plays that minimises the regret\nafter n plays. The regret is the comparison, in expectation, of the realised reward against an oracle\nthat always plays the best arm. The well known Upper Con\ufb01dence Bound (UCB) algorithm [3],\nachieves regret O(K log(n)) after n plays (ignoring mean rewards) and is minimax optimal [9].\nIn this paper, we propose a new take on this important problem. In many practical scenarios of\ninterest, one can associate a cost to playing each arm. Furthermore, in many of these scenarios,\none might have access to cheaper approximations to the outcome of the arms. For instance, in\nonline advertising the goal is to maximise the cumulative number of clicks over a given time period.\nConventionally, an arm pull maybe thought of as the display of an ad for a speci\ufb01c time, say one\nhour. However, we may approximate its hourly performance by displaying the ad for shorter periods.\nThis estimate is biased (and possibly noisy), as displaying an ad for longer intervals changes user\nbehaviour. It can nonetheless be useful in gauging the long run click through rate. We can also\nobtain biased estimates of an ad by displaying it only to certain geographic regions or age groups.\nSimilarly one might consider algorithm selection for machine learning problems [4], where the goal\nis to be competitive with the best among a set of learning algorithms for a task. Here, one might\nobtain cheaper approximate estimates of the performance of algorithm by cheaper versions using\nless data or computation. In this paper, we will refer to such approximations as \ufb01delities. Consider a\n2-\ufb01delity problem where the cost at the low \ufb01delity is \u03bb(1) and the cost at the high \ufb01delity is \u03bb(2).\nWe will present a cost weighted notion of regret for this setting for a strategy that expends a capital\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(cid:0)(\u03bb(1)K + \u03bb(2)|Kg|) log(\u039b/\u03bb(2))(cid:1) regret. Here Kg is a (typically) small\n\nof \u039b units. A classical K-armed bandit strategy such as UCB, which only uses the highest \ufb01delity,\ncan obtain at best O(\u03bb(2)K log(\u039b/\u03bb(2))) regret [9]. In contrast, this paper will present multi-\ufb01delity\nstrategies that achieve O\nsubset of arms with high expected reward that can be identi\ufb01ed using plays at the (cheaper) low\n\ufb01delity. When |Kg| < K and \u03bb(1) < \u03bb(2), such a strategy will outperform the more standard UCB\nalgorithms. Intuitively, this is achieved by using the lower \ufb01delities to eliminate several of \u201cbad\u201d\narms and reserving expensive higher \ufb01delity plays for a small subset of the most promising arms. We\nformalise the above intuitions in the sequel. Our main contributions are,\n1. A novel formalism for studying bandit tasks when one has access to multiple \ufb01delities for each\narm, with each successive \ufb01delity providing a better approximation to the most expensive one.\n2. A new algorithm that we call Multi-Fidelity Upper Con\ufb01dence Bound (MF-UCB) that adapts\nthe classical Upper Con\ufb01dence Bound (UCB) strategies to our multi-\ufb01delity setting. Empirically,\nwe demonstrate that our algorithm outperforms naive UCB on simulations.\n\n3. A theoretical characterisation of the performance of MF-UCB that shows that the algorithm\n(a) uses the lower \ufb01delities to explore all arms and eliminates arms with low expected reward, and\n(b) reserves the higher \ufb01delity plays for arms with rewards close to the optimal value. We derive\na lower bound on the regret and demonstrate that MF-UCB is near-optimal on this problem.\n\nRelated Work\nThe K-armed bandit has been studied extensively in the past [1, 9, 11]. There has been a \ufb02urry of\nwork on upper con\ufb01dence bound (UCB) methods [2, 3], which adopt the optimism in the face of\nuncertainty principle for bandits. For readers unfamiliar with UCB methods, we recommend Chapter\n2 of Bubeck and Cesa-Bianchi [5]. Our work in this paper builds on UCB ideas, but the multi-\ufb01delity\nframework poses signi\ufb01cantly new algorithmic and theoretical challenges.\nThere has been some interest in multi-\ufb01delity methods for optimisation in many applied domains\nof research [7, 10]. However, these works do not formalise or analyse notions of regret in the\nmulti-\ufb01delity setting. Multi-\ufb01delity methods are used in the robotics community for reinforcement\nlearning tasks by modeling each \ufb01delity as a Markov decision process [6]. Zhang and Chaudhuri [16]\nstudy active learning with a cheap weak labeler and an expensive strong labeler. The objective of\nthese papers however is not to handle the exploration-exploitation trade-off inherent to the bandit\nsetting. A line of work on budgeted multi-armed bandits [13, 15] study a variant of the K-armed\nbandit where each arm has a random reward and cost and the goal is to play the arm with the highest\nreward/cost ratio as much as possible. This is different from our setting where each arm has multiple\n\ufb01delities which serve as an approximation. Recently, in Kandasamy et al. [8] we extended ideas in\nthis work to analyse multi-\ufb01delity bandits with Gaussian process payoffs.\n\n2 The Stochastic K-armed Multi-\ufb01delity Bandit\n\nIn the classical K-armed bandit, each arm k \u2208 K = {1, . . . , K} is associated with a real valued\ndistribution \u03b8k with mean \u00b5k. Let K(cid:63) = argmaxk\u2208K\n\u00b5k be the set of optimal arms, k(cid:63) \u2208 K(cid:63) be\nexpected rewards after n time steps(cid:80)n\nan optimal arm and \u00b5(cid:63) = \u00b5k(cid:63) denote the optimal mean value. A bandit strategy would play an\n(cid:80)n\narm It \u2208 K at each time step t and observe a sample from \u03b8It. Its goal is to maximise the sum of\nt=1 \u00b5It, or equivalently minimise the cumulative pseudo-regret\nt=1 \u00b5(cid:63) \u2212 \u00b5It for all values of n. In other words, the objective is to be competitive, in expectation,\nagainst an oracle that plays an optimal arm all the time.\nIn this work we differ from the usual bandit setting in the following aspect. For each arm k, we have\naccess to M \u2212 1 successively approximate distributions \u03b8(1)\nto the desired distribu-\ntion \u03b8(M )\nk = \u03b8k. We will refer to these approximations as \ufb01delities. Clearly, these approximations are\nmeaningful only if they give us some information about \u03b8(M )\n. In what follows, we will assume that\nthe mth \ufb01delity mean of an arm is within \u03b6 (m), a known quantity, of its highest \ufb01delity mean, where\n\u03b6 (m), decreasing with m, characterise the successive approximations. That is, |\u00b5(M )\n| \u2264 \u03b6 (m)\nfor all k \u2208 K and m = 1, . . . , M, where \u03b6 (1) > \u03b6 (2) > \u00b7\u00b7\u00b7 > \u03b6 (M ) = 0 and the \u03b6 (m)\u2019s are known. It\nis possible for the lower \ufb01delities to be misleading under this assumption: there could exist an arm k\nwith \u00b5(M )\nfor any m < M. In other words,\nwe wish to explicitly account for the biases introduced by the lower \ufb01delities, and not treat them\n\nk > \u00b5(cid:63) and/or \u00b5(m)\n\nk < \u00b5(cid:63) = \u00b5(M )\n\nk , \u03b8(2)\n\nk , . . . , \u03b8(M\u22121)\n\nk > \u00b5(m)\n\nk(cid:63)\n\nk \u2212 \u00b5(m)\n\nk\n\nbut with \u00b5(m)\n\nk(cid:63)\n\nk\n\nk\n\n2\n\n\fN(cid:88)\n\n(cid:33)\n\nN(cid:88)\n(cid:123)(cid:122)\n\nt=1\n\nN(cid:88)\n(cid:124)\n\nt=1\n\n(cid:32)\n(cid:124)\n\nk\n\nk,t\n\n(m)\n\nk\u2208K\n\nT (m)\nk,t\n\nt =(cid:80)\n\nk,s denotes the mean of s samples drawn from \u03b8(m)\n\nas just a higher variance observation of an expensive experiment. This problem of course becomes\ninteresting only when lower \ufb01delities are more attractive than higher \ufb01delities in terms of some notion\nof cost. Towards this end, we will assign a cost \u03bb(m) (such as advertising time, money etc.) to playing\nan arm at \ufb01delity m where \u03bb(1) < \u03bb(2) \u00b7\u00b7\u00b7 < \u03bb(M ).\nNotation: T (m)\nk,t denotes the number of plays at arm k, at \ufb01delity m until t time steps. T (>m)\nis the number of plays at \ufb01delities greater than m. Q(m)\nis the number of \ufb01delity\nm plays at all arms until time t. X\n. Denote\n\u2206(m)\nk = \u00b5(cid:63) \u2212 \u00b5(m)\nk \u2212 \u03b6 (m). When s refers to the number of plays of an arm, we will take 1/s = \u221e\nif s = 0. A denotes the complement of a set A \u2282 K. While discussing the intuitions in our proofs\nand theorems we will use (cid:16), (cid:46), (cid:38) to denote equality and inequalities ignoring constants.\nRegret in the multi-\ufb01delity setting: A strategy for a multi-\ufb01delity bandit problem, at time t,\nproduces an arm-\ufb01delity pair (It, mt), where It \u2208 K and mt \u2208 {1, . . . , M}, and observes a sample\nXt drawn (independently of everything else) from the distribution \u03b8(mt)\n. The choice of (It, mt) could\nIt\ndepend on previous arm-observation-\ufb01delity tuples {(Ii, Xi, mi)}t\u22121\ni=1. The multi-\ufb01delity setting calls\nfor a new notion of regret. For any strategy A that expends \u039b units of the resource, we will de\ufb01ne\nthe pseudo-regret R(\u039b,A) as follows. Let qt denote the instantaneous pseudo-reward at time t and\nrt = \u00b5(cid:63) \u2212 qt denote the instantaneous pseudo-regret. We will discuss choices for qt shortly. Any\nnotion of regret in the multi-\ufb01delity setting needs to account for this instantaneous regret along with\nthe cost of the \ufb01delity at which we played at time t, i.e. \u03bb(mt). Moreover, we should receive no\nreward (maximum regret) for any unused capital. These observations lead to the following de\ufb01nition,\n\nk\n\n.\n\n+\n\nt=1\n\n\u00b5(cid:63)\n\n(1)\n\n(cid:125)\n\n\u03bb(mt)\n\n\u02dcr(\u039b,A)\n\n\u039b \u2212\n\n(cid:123)(cid:122)\n\n\u02dcR(\u039b,A)\n\n\u03bb(mt)rt\n\n\u03bb(mt)qt =\n\nR(\u039b,A) = \u039b\u00b5(cid:63) \u2212\n\n(cid:125)\n(cid:80)n\nAbove, N is the (random) number of plays within capital \u039b by A, i.e.\nthe largest n such that\nt=1 \u03bb(mt) \u2264 \u039b. To motivate our choice of qt we consider an online advertising example where\n\u03bb(m) is the advertising time at \ufb01delity m and \u00b5(m)\nis the expected number of clicks per unit time.\nWhile we observe from \u03b8(mt)\nat time t, we wish to reward the strategy according to its highest \ufb01delity\ndistribution \u03b8(M )\n. Here, we are\ncompeting against an oracle which plays an optimal arm at any \ufb01delity all the time. Note that we\nmight have chosen qt to be \u00b5(mt)\n. However, this does not re\ufb02ect the motivating applications for the\nmulti-\ufb01delity setting that we consider. For instance, a clickbait ad might receive a high number of\nclicks in the short run, but its long term performance might be poor. Furthermore, for such a choice,\nwe may as well ignore the rich structure inherent to the multi-\ufb01delity setting and simply play the arm\nargmaxm,k \u00b5(m)\nat each time. There are of course other choices for qt that result in very different\nnotions of regret; we discuss this brie\ufb02y at the end of Section 7.\n\n. Therefore regardless of which \ufb01delity we play we set qt = \u00b5(M )\n\nIt\n\nIt\n\nIt\n\nIt\n\nk\n\nk\n\n(m)\n\n(m)\n\nP(cid:0)X\n\nP(cid:0)X\n\nk,s \u2212 \u00b5(m)\n\nneed to be well behaved for the problem to be tractable. We will assume that\n\nk > \u0001(cid:1) < \u03bde\u2212s\u03c8(\u0001),\n\nk < \u2212\u0001(cid:1) < \u03bde\u2212s\u03c8(\u0001). (2)\n\nThe distributions \u03b8(m)\nthey satisfy concentration inequalities of the following form. For all \u0001 > 0,\nk,s \u2212 \u00b5(m)\n\u2200 m, k,\nHere \u03bd > 0 and \u03c8 is an increasing function with \u03c8(0) = 0 and is at least increasing linearly\n\u03c8(x) \u2208 \u2126(x). For example, if the distributions are sub-Gaussian, then \u03c8(x) \u2208 \u0398(x2).\nThe performance of a multi-\ufb01delity strategy which switches from low to high \ufb01delities can be\nworsened by arti\ufb01cially inserting \ufb01delities. Consider a scenario where \u03bb(m+1) is only slightly larger\nthan \u03bb(m) and \u03b6 (m+1) is only slightly smaller than \u03b6 (m). This situation is unfavourable since there\nisn\u2019t much that can be inferred from the (m + 1)th \ufb01delity that cannot already be inferred from the mth\nby expending the same cost. We impose the following regularity condition to avoid such situations.\nfor all m < M.\nAssumption 1 is not necessary to analyse our algorithm, however, the performance of MF-UCB when\ncompared to UCB is most appealing when the above holds. In cases where M is small enough and\n\nAssumption 1. The \u03b6 (m)\u2019s decay fast enough such that(cid:80)m\n\n\u03c8(\u03b6(i)) \u2264\n\n\u03c8(\u03b6(m+1))\n\ni=1\n\n1\n\n1\n\n3\n\n\fcan be treated as a constant, the assumption is not necessary. For sub-Gaussian distributions, the\ncondition is satis\ufb01ed for an exponentially decaying (\u03b6 (1), \u03b6 (2), . . . ) such as (1/\u221a2, 1/2, 1/2\u221a2 . . . ).\nOur goal is to design a strategy A0 that has low expected pseudo-regret E[R(\u039b,A0)] for all values of\n(suf\ufb01ciently large) \u039b, i.e. the equivalent of an anytime strategy, as opposed to a \ufb01xed time horizon\nstrategy, in the usual bandit setting. The expectation is over the observed rewards which also dictates\nthe number of plays N. From now on, for simplicity we will write R(\u039b) when A is clear from\ncontext and refer to it just as regret.\n\n3 The Multi-Fidelity Upper Con\ufb01dence Bound (MF-UCB) Algorithm\nAs the name suggests, the MF-UCB algorithm maintains an upper con\ufb01dence bound corresponding\nfor each m \u2208 {1, . . . , M} and k \u2208 K based on its previous plays. Following UCB\nto \u00b5(m)\nstrategies [2, 3], we de\ufb01ne the following set of upper con\ufb01dence bounds,\nB(m)\nk,t (s) = X\n\nk,s + \u03c8\u22121(cid:16) \u03c1 log t\n\nfor all m \u2208 {1, . . . , M} , k \u2208 K\n\n+ \u03b6 (m),\n\n(cid:17)\n\n(m)\n\nk\n\nBk,t = min\n\nm=1,...,M B(m)\n\ns\nk,t (T (m)\nk,t\u22121).\n\n(3)\n\nk\n\nk,t (T (m)\n\nimply a constraint on the value of \u00b5(M )\n\nHere \u03c1 is a parameter in our algorithm and \u03c8 is from (2). Each B(m)\nk,t\u22121) provides a high\nprobability upper bound on \u00b5(M )\nk with their minimum Bk,t giving the tightest bound (See Appendix A).\nSimilar to UCB, at time t we play the arm It with the highest upper bound It = argmaxk\u2208K Bk,t.\nSince our setup has multiple \ufb01delities associated with each arm, the algorithm needs to determine\nat each time t which \ufb01delity (mt) to play the chosen arm (It). For this consider an arbitrary \ufb01delity\nm < M. The \u03b6 (m) conditions on \u00b5(m)\n. If, at \ufb01delity m, the\nuncertainty interval \u03c8\u22121(\u03c1 log(t)/T (m)\nsuf\ufb01ciently\nwell yet. There is more information to be gleaned about \u00b5(M )\nfrom playing the arm It at \ufb01delity m.\nOn the other hand, playing at \ufb01delity m inde\ufb01nitely will not help us much since the \u03b6 (m) elongation of\nthe con\ufb01dence band caps off how much we can learn about \u00b5(M )\nfrom \ufb01delity m; i.e. even if we knew\n\u00b5(m)\nto within a \u00b1\u03b6 (m) interval. Our algorithm captures this\nIt\nnatural intuition. Having selected It, we begin checking at the \ufb01rst \ufb01delity. If \u03c8\u22121(\u03c1 log(t)/T (1)\nIt,t\u22121)\nis smaller than a threshold \u03b3(1) we proceed to check the second \ufb01delity, continuing in a similar\nfashion. If at any point \u03c8\u22121(\u03c1 log(t)/T (m)\nIt,t\u22121) \u2265 \u03b3(m), we play It at \ufb01delity mt = m. If we go\nall the way to \ufb01delity M, we play at mt = M. The resulting procedure is summarised below in\nAlgorithm 1.\n\nIt,t\u22121) is large, then we have not constrained \u00b5(M )\n\n, we will have only constrained \u00b5(M )\n\nIt\n\nIt\n\nIt\n\nIt\n\nk\n\nAlgorithm 1 MF-UCB\n\u2022 for t = 1, 2, . . .\n\n1. Choose It \u2208 argmaxk\u2208K Bk,t.\n2. mt = minm { m | \u03c8\u22121(\u03c1 log t/T (m)\n3. Play X \u223c \u03b8(mt)\n\nIt\n\n.\n\n(See equation (3).)\n\nIt,t\u22121) \u2265 \u03b3(m) \u2228 m = M}\n\nChoice of \u03b3(m): In our algorithm, we choose\n\n(See equation (4).)\n\n(4)\n\nTo motivate this choice, note that if \u2206(m)\nk \u2212 \u03b6 (m) > 0 then we can conclude that arm k\n(cid:38) \u03b3(m) from plays\nis not optimal. Step 2 of the algorithm attempts to eliminate arms for which \u2206(m)\nabove the mth \ufb01delity. If \u03b3(m) is too large, then we would not eliminate a suf\ufb01cient number of arms\nwhereas if it was too small we could end up playing a suboptimal arm k (for which \u00b5(m)\nk > \u00b5(cid:63)) too\nmany times at \ufb01delity m. As will be revealed by our analysis, the given choice represents an optimal\ntradeoff under the given assumptions.\n\nk\n\n(cid:18) \u03bb(m)\n\n\u03c8(cid:0)\u03b6 (m)(cid:1)(cid:19)\n\n\u03b3(m) = \u03c8\u22121\n\n\u03bb(m+1)\nk = \u00b5(cid:63) \u2212 \u00b5(m)\n\n4\n\n\fFigure 1:\nIllustration of the partition K(m)\u2019s\nfor a M = 4 \ufb01delity problem.\nThe sets\nJ (m)\n\u03b6(m)+2\u03b3(m) are indicated next to their bound-\naries. K(1),K(2),K(3),K(4) are shown in yellow,\ngreen, red and purple respectively. The optimal\narms K(cid:63) are shown as a black circle.\n\n4 Analysis\n\nWe will be primarily concerned with the term \u02dcR(\u039b,A) = \u02dcR(\u039b) from (1). \u02dcr(\u039b,A) is a residual term;\nit is an artefact of the fact that after the N + 1th play, the spent capital would have exceeded \u039b. For any\nalgorithm that operates oblivious to a \ufb01xed capital, it can be bounded by \u03bb(M )\u00b5(cid:63) which is negligible\ncompared to \u02dcR(\u039b). According to the above, we have the following expressions for \u02dcR(\u039b):\n\n(cid:88)\n\n(cid:32) M(cid:88)\n\nk\u2208K\n\nm=1\n\n(cid:33)\n\n\u02dcR(\u039b) =\n\n\u2206(M )\n\nk\n\n\u03bb(m)T (m)\nk,N\n\n,\n\n(5)\n\n\u03b7\n\nk\n\n(1)\n\n(m)\n\n(cid:96)=1\n\n\u03b6((cid:96))+2\u03b3((cid:96))\n\n\u03b6((cid:96))+2\u03b3((cid:96))\n\nJ ((cid:96))\n\nis at least\n\n\u03b6(m)+2\u03b3(m) \u2229\n\n(cid:18) m\u22121(cid:92)\n\n(cid:18) M\u22121(cid:92)\n\n= {k \u2208 K; \u00b5(cid:63) \u2212 \u00b5(m)\n\n, \u2200m\u2264M \u2212 1, K(M ) (cid:44) K(cid:63) \u2229\n\nk > 2\u03b3(1)} to be the arms whose \ufb01rst \ufb01delity mean \u00b5(1)\n\nCentral to our analysis will be the following partitioning of K. First denote the set of arms whose\nk \u2264 \u03b7}. De\ufb01ne K(1) (cid:44)\n\ufb01delity m mean is within \u03b7 of \u00b5(cid:63) to be J (m)\n\u03b6(1)+2\u03b3(1) = {k \u2208 K; \u2206(1)\n(cid:19)\n(cid:19)\nJ\n\u03b6 (1) + 2\u03b3(1) below the optimum \u00b5(cid:63). Then we recursively de\ufb01ne,\nK(m) (cid:44) J\nJ ((cid:96))\nany k \u2208 K,(cid:74)k(cid:75) will denote the partition k belongs to, i.e.(cid:74)k(cid:75) = m s.t. k \u2208 K(m). We will see that\nk > 2\u03b3(m) and \u2206((cid:96))\nObserve that for all k \u2208 K(m), \u2206(m)\nk \u2264 2\u03b3((cid:96)) for all (cid:96) < m. For what follows, for\nK(m) are the arms that will be played at the mth \ufb01delity but can be excluded from \ufb01delities higher\nthan m using information at \ufb01delity m. See Fig. 1 for an illustration of these partitions.\nRecall that N = (cid:80)M\n\nN is the total (random) number of plays by a multi-\ufb01delity strategy\nwithin capital \u039b. Let n\u039b = (cid:98)\u039b/\u03bb(M )(cid:99) be the (non-random) number of plays by any strategy that\noperates only on the highest \ufb01delity. Since \u03bb(m) < \u03bb(M ) for all m < M, N could be large for an\narbitrary multi-\ufb01delity method. However, our analysis reveals that for MF-UCB, N (cid:46) n\u039b with high\nprobability. The following theorem bounds R for MF-UCB. The proof is given in Appendix A. For\nclarity, we ignore the constants but they are \ufb02eshed out in the proofs.\nTheorem 2 (Regret Bound for MF-UCB). Let \u03c1 > 4. There exists \u039b0 depending on \u03bb(m)\u2019s such that\nfor all \u039b > \u039b0, MF-UCB satis\ufb01es,\n\n4.1 Regret Bound for MF-UCB\n\nm=1 Q(m)\n\n(cid:96)=1\n\n.\n\nE[R(\u039b)]\nlog(n\u039b)\n\n(cid:46) (cid:88)\n\nk /\u2208K(cid:63)\n\n\u2206(M )\n\nk\n\n\u00b7\n\n\u03bb((cid:74)k(cid:75))\n\u03c8(\u2206((cid:74)k(cid:75))\n\nk\n\n) (cid:16)\n\nM(cid:88)\n\n(cid:88)\n(cid:80)\n\nm=1\n\nk\u2208K(m)\n\n\u2206(M )\n\nk\n\n\u03bb(m)\n\u03c8(\u2206(m)\n\nk\n\n)\n\nk\n\nlog(n\u039b) terms, regret for MF-UCB due to arm k is Rk,MF-UCB (cid:16) \u03bb((cid:74)k(cid:75))/\u03c8(\u2206((cid:74)k(cid:75))\n\nLet us compare the above bound to UCB whose regret is\n. We will\n\ufb01rst argue that MF-UCB does not do signi\ufb01cantly worse than UCB in the worst case. Modulo the\n\u2206(M )\n). Consider\nany k \u2208 K(m), m < M for which \u2206(m)\nk \u2264 \u2206((cid:74)k(cid:75))\n\n+ 2\u03b6 ((cid:74)k(cid:75)) (cid:46) \u03c8\u22121(cid:16) \u03bb((cid:74)k(cid:75)+1)\n\u03bb((cid:74)k(cid:75))\n\nk > 2\u03b3(m). Since\n\n\u03c8(\u2206((cid:74)k(cid:75))\n\n\u03bb(M )\n\u03c8(\u2206(M )\n\n\u2206(M )\n\n\u2206(M )\n\nk /\u2208K(cid:63)\n\n(cid:17)\n\n)\n\nk\n\nk\n\nk\n\nk\n\n,\n\n)\n\nk\n\nE[R(\u039b)]\nlog(n\u039b) (cid:16)\n\n5\n\nK(2)K(2)K(1)K(1)K(4)K(4)K(3)K(3)K\u21e4K\u21e4J(2)\u21e3(2)+2(2)J(2)\u21e3(2)+2(2)J(3)\u21e3(3)+2(3)J(3)\u21e3(3)+2(3)J(1)\u21e3(1)+2(1)J(1)\u21e3(1)+2(1)\fk\n\nk\n\nk\n\n\u03bb(M )\n\nis comparable to or larger than \u2206(M )\n\n) (cid:38)\na (loose) lower bound for UCB for the same quantity is Rk,UCB (cid:16) \u03bb(M )/\u03c8(\u2206(M )\n\u03bb((cid:74)k(cid:75)+1) Rk,MF-UCB. Therefore for any k \u2208 K(m), m < M, MF-UCB is at most a constant times\nworse than UCB. However, whenever \u2206((cid:74)k(cid:75))\n, MF-UCB outper-\nforms UCB by a factor of \u03bb((cid:74)k(cid:75))/\u03bb(M ) on arm k. As can be inferred from the theorem, most of the\ncost invested by MF-UCB on arm k is at the(cid:74)k(cid:75)th \ufb01delity. For example, in Fig. 1, MF-UCB would not\nplay the yellow arms K(1) beyond the \ufb01rst \ufb01delity (more than a constant number of times). Similarly\nall green and red arms are played mostly at the second and third \ufb01delities respectively. Only the blue\narms are played at the fourth (most expensive) \ufb01delity. On the other hand UCB plays all arms at the\nfourth \ufb01delity. Since lower \ufb01delities are cheaper MF-UCB achieves better regret than UCB.\nIt is essential to note here that \u2206(M )\nis small for arms in in K(M ). These arms are close to the\noptimum and require more effort to distinguish than arms that are far away. MF-UCB, like UCB ,\ninvests log(n\u039b)\u03bb(M )/\u03c8(\u2206(M )\n) capital in those arms. That is, the multi-\ufb01delity setting does not help\nus signi\ufb01cantly with the \u201chard-to-distinguish\u201d arms. That said, in cases where K is very large and the\nsets K(M ) is small the bound for MF-UCB can be appreciably better than UCB.\n4.2 Lower Bound\nSince, N \u2265 n\u039b = (cid:98)\u039b/\u03bb(M )(cid:99), any multi-\ufb01delity strategy which plays a suboptimal arm a polynomial\nnumber of times at any \ufb01delity after n time steps, will have worse regret than MF-UCB (and UCB).\nTherefore, in our lower bound we will only consider strategies which satisfy the following condition.\nAssumption 3. Consider the strategy after n plays at any \ufb01delity. For any arm with \u2206(M )\nk > 0, we\n\nk\n\nk\n\nhave E[(cid:80)M\n\nm=1 T (m)\n\nk,n ] \u2208 o(na) for any a > 0 .\n\nk\n\nfor each \ufb01delity m and\n. It is known that for Bernoulli distributions \u03c8(\u0001) \u2208 \u0398(\u00012) [14]. To state\n\nFor our lower bound we will consider a set of Bernoulli distributions \u03b8(m)\neach arm k with mean \u00b5(m)\nour lower bound we will further partition the set K(m) into two sets K(m)\n\u0013 = {k \u2208 K(m) : \u2206((cid:96))\nK(m)\nk > 0}.\nFor any k \u2208 K(m) our lower bound, given below, is different depending on which set k belongs to.\nTheorem 4 (Lower bound for R(\u039b)). Consider any set of Bernoulli reward distributions with\n\u00b5(cid:63) \u2208 (1/2, 1) and \u03b6 (1) < 1/2. Then, for any strategy satisfying Assumption 3 the following holds.\n\nas follows,\nK(m)\n\u0017 = {k \u2208 K(m) : \u2203 (cid:96) < m s.t. \u2206((cid:96))\n\nk \u2264 0 \u2200(cid:96) < m},\n\n\u0013 ,K(m)\n\nk\n\n\u0017\n\nE[R(\u039b)]\nlog(n\u039b) \u2265 c \u00b7\n\nlim inf\n\u039b\u2192\u221e\n\n\u2206(M )\n\nk\n\n\u03bb(m)\n\u2206(m)\n\n2 +\n\n\u2206(M )\n\nk\n\nmin\n\n(cid:96)\u2208Lm(k)\n\n\u03bb((cid:96))\n2\n\u2206((cid:96))\nk\n\n(6)\n\n\u0017\n\n\u0017\n\n\u0017\n\nk\n\nk\u2208K(m)\n\nk > 0.\n\nk < 0 since \u2206((cid:96))\n\n, there exists some (cid:96) < m such that 0 < \u2206((cid:96))\n\nk\u2208K(m)\nHere c is a problem dependent constant. Lm(k) = {(cid:96) < m : \u2206((cid:96))\nk > 0} \u222a {m} is the union of the\nmth \ufb01delity and all \ufb01delities smaller than m for which \u2206((cid:96))\nComparing this with Theorem 2 we \ufb01nd that MF-UCB meets the lower bound on all arms k \u2208\nK(m)\n\u0013 , \u2200m. However, it may be loose on any k \u2208 K(m)\n. The gap can be explained as follows. For\nk \u2208 K(m)\nk < 2\u03b3((cid:96)). As explained previously, the\nswitching criterion of MF-UCB ensures that we do not invest too much effort trying to distinguish\nwhether \u2206((cid:96))\nk could be very small. That is, we proceed to the next \ufb01delity only if we\ncannot conclude \u2206((cid:96))\n<\nk\n\u03bb(m)/\u2206(m)\nk > 2\u03b3(m). Consider for example a two \ufb01delity problem where\n\u2206 = \u2206(1)\nsuboptimal at the \ufb01rst \ufb01delity with \u03bb(1) log(n\u039b)/\u22062 capital instead of \u03bb(2) log(n\u039b)/\u22062 at the second\n\ufb01delity. However, MF-UCB distinguishes this arm at the higher \ufb01delity as \u2206 < 2\u03b3(m) and therefore\ndoes not meet the lower bound on this arm. While it might seem tempting to switch based on estimates\nfor \u2206(1)\nfor an arm requires log(n\u039b)/\u03c8(\u2206(2)\nk )\nsamples at the second \ufb01delity; this is is exactly what we are trying to avoid for the majority of the\narms via the multi-\ufb01delity setting. We leave it as an open problem to resolve this gap.\n\nk < 2(cid:112)\u03bb(1)/\u03bb(2)\u03b6 (1). Here it makes sense to distinguish the arm as being\n\n(cid:46) \u03b3((cid:96)). However, since \u03bb(m) > \u03bb((cid:96)) it might be the case that \u03bb((cid:96))/\u2206((cid:96))\n\nk , this idea is not desirable as estimating \u2206(2)\n\n2 even though \u2206(m)\n\nk = \u2206(2)\n\nk , \u2206(2)\n\nk\n\nk\n\nk\n\n2\n\n\uf8ee\uf8ef\uf8f0 (cid:88)\n\n\u0013\n\nM(cid:88)\n\nm=1\n\n(cid:88)\n\n\uf8f9\uf8fa\uf8fb\n\n6\n\n\fK(1)\n\nlog(n)\n\u03c8(\u2206(1)\nk )\n\nO(1)\n\nK(2)\n\nlog(n)\n\u03c8(\u03b3(1))\nlog(n)\n\u03c8(\u2206(2)\nk )\n\nO(1)\n\nE[T (1)\nk,n]\nE[T (2)\nk,n]\n...\nE[T (m)\nk,n ]\n...\nE[T (M )\nk,n ]\n\n. . .\n\n. . .\n\n. . .\n\nK(m)\n\nlog(n)\n\u03c8(\u03b3(1))\nlog(n)\n\u03c8(\u03b3(2))\n\nlog(n)\n\u03c8(\u2206(m)\n\nk\n\n)\n\n. . .\n\n. . .\n\n. . .\n\nK(M )\n\nlog(n)\n\u03c8(\u03b3(1))\nlog(n)\n\u03c8(\u03b3(2))\n\nK(cid:63)\n\nlog(n)\n\u03c8(\u03b3(1))\nlog(n)\n\u03c8(\u03b3(2))\n\nlog(n)\n\u03c8(\u03b3(m))\n\nlog(n)\n\u03c8(\u03b3(m))\n\nO(1)\n\nlog(n)\n\u03c8(\u2206(M )\n\nk\n\n\u2126(n)\n\n)\n\nTable 1: Bounds on the expected number of plays for each k \u2208 K(m) (columns) at each \ufb01delity\n(rows) after n time steps (i.e. n plays at any \ufb01delity) in MF-UCB.\n5 Proof Sketches\n\nE[T (>(cid:74)k(cid:75))\n\nk,n\n\n] \u2264 O(1).\n\nk T (m)\n\nk,N be the regret\n\n(7)\n\n5.1 Theorem 2\nFirst we analyse MF-UCB after n plays (at any \ufb01delity) and control the number of plays of an arm at\nvarious \ufb01delities depending on which K(m) it belongs to. To that end we prove the following.\nLemma 5. (Bounding E[T (m)\nk,n ] \u2013 Informal) After n time steps of MF-UCB for any k \u2208 K,\n\n, \u2200 (cid:96) <(cid:74)k(cid:75),\n\nE[T ((cid:74)k(cid:75))\n\nlog(n)\n\nT ((cid:96))\nk,n\n\nk,n ] (cid:46)\n\n(cid:46) log(n)\n\u03c8(\u03b3(m))\n\n\u03c8(\u2206((cid:74)k(cid:75))\nThe bounds above are illustrated in Table 1. Let \u02dcRk(\u039b) = (cid:80)M\nm=1 \u03bb(m)\u2206(M )\nincurred due to arm k and \u02dcRkn = E[ \u02dcRk(\u039b)|N = n]. Using Lemma 5 we have,\n\u03bb((cid:74)k(cid:75))\n\u03c8(\u2206((cid:74)k(cid:75))\n\n(cid:46) (cid:74)k(cid:75)\u22121(cid:88)\n\n\u03c8(\u03b3(m))\n\n\u2206(M )\n\nlog(n)\n\n+ o(1)\n\n\u02dcRkn\n\n\u03bb((cid:96))\n\n/2)\n\n/2)\n\n+\n\nk\n\n,\n\n(cid:96)=1\n\nk\n\nk\n\n,\n\nk\n\n1\n\nP\n\nk,n\n\n(cid:46)\n\n/2)\n\nlog(n)\n\n(cid:33)\n\nP(cid:16)\n\n(cid:38) x \u00b7\n\nT (>(cid:74)k(cid:75))\n\n(cid:17) (cid:46) 1\n\n(cid:32)\nT ((cid:74)k(cid:75))\n\nThe next step will be to control the number of plays N within capital \u039b which will bound E[log(N )].\nWhile \u039b/\u03bb(1) is an easy bound, we will see that for MF-UCB, N will be on the order of n\u039b = \u039b/\u03bb(M ).\nFor this we will use the following high probability bounds on T (m)\nk,n .\nLemma 6. (Bounding P(T (m)\n\nk,n > \u00b7 ) \u2013 Informal) After n time steps of MF-UCB for any k \u2208 K,\n\u03c8(\u2206((cid:74)k(cid:75))\nWe bound the number of plays at \ufb01delities less than M via Lemma 6 and obtain n/2 >(cid:80)M\u22121\n\nm=1 Q(m)\nwith probability greater than, say \u03b4, for all n \u2265 n0. By setting \u03b4 = 1/ log(\u039b/\u03bb(1)), we get\nE[log(N )] (cid:46) log(n\u039b). The actual argument is somewhat delicate since \u03b4 depends on \u039b.\nn\u039b. Then we we argue that the regret incurred by an arm k at \ufb01delities less than(cid:74)k(cid:75) (\ufb01rst term in the\nThis gives as an expression for the regret due to arm k to be of the form (7) where n is replaced by\nRHS of (7)) is dominated by \u03bb((cid:74)k(cid:75))/\u03c8(\u2206((cid:74)k(cid:75))\nthat(cid:80)M\u22121\n) (second term). This is possible due to the design of\nthe sets K(m) and Assumption 1. While Lemmas 5, 6 require only \u03c1 > 2, we need \u03c1 > 4 to ensure\nremains sublinear when we plug-in the probabilities from Lemma 6. \u03c1 > 2 is\nattainable with a more careful design of the sets K(m). The \u039b > \u039b0 condition is needed because\ninitially MF-UCB is playing at lower \ufb01delities and for small \u039b, N could be much larger than n\u039b.\n\nm=1 Q(m)\n\nnx\u03c1\u22121\n\nx\u03c1\u22122\n\n> x\n\nk,n\n\nn\n\nn\n\nk\n\n.\n\n5.2 Theorem 4\n\nFirst we show that for an arm k with \u2206(p)\n\nk > 0 and \u2206((cid:96))\n\n(cid:20)\nk \u2264 0 for all (cid:96) < p, any strategy should satisfy\n\n(cid:21)\n\nRk(\u039b) (cid:38) log(n\u039b) \u2206(M )\n\nk\n\nmin\n(cid:96)\u2265p,\u2206((cid:96))\n\nk >0\n\n\u03bb((cid:96))\n2\n\u2206((cid:96))\nk\n\n7\n\n\fFigure 2: Simulations results on the synthetic problems. The \ufb01rst four \ufb01gures compares UCB against MF-UCB\non four synthetic problems. The title states K, M and the costs \u03bb(1), . . . , \u03bb(M ). The \ufb01rst two used Gaussian\nrewards and the last two used Bernoulli rewards. The last two \ufb01gures show the number of plays by UCB and\nMF-UCB on a K = 500, M = 3 problem with Gaussian observations (corresponding to the \ufb01rst \ufb01gure).\n\nk\n\nk\n\n\u0017\n\nk\n\nk\n\nk , (cid:96) = 1, . . . , M where \u02dc\u00b5((cid:96))\n\nk = \u00b5((cid:96))\nk slightly above \u00b5(cid:63) \u2212 \u03b6 ((cid:96)) from (cid:96) = m all the way to M where \u02dc\u00b5(M )\n\nwhere Rk is the regret incurred due to arm k. The proof uses a change of measure argument. The\nmodi\ufb01cation has Bernoulli distributions with mean \u02dc\u00b5((cid:96))\nfor all\n(cid:96) < m. Then we push \u02dc\u00b5((cid:96))\nk > \u00b5(cid:63).\nTo control the probabilities after changing to \u02dc\u00b5((cid:96))\nk we use the conditions in Assumption 3. Then for\n2 (cid:38) \u03bb(m)/\u2206(m)\n2 using, once again the design of the sets K(m).\nk \u2208 K(m) we argue that \u03bb((cid:96))\u2206((cid:96))\nThis yields the separate results for k \u2208 K(m)\n\u0013 ,K(m)\n.\n6 Some Simulations on Synthetic Problems\nWe compare UCB against MF-UCB on a series of synthetic problems. The results are given in\nFigure 2. Due to space constraints, the details on these experiments are given in Appendix C. Note\nthat MF-UCB outperforms UCB on all these problems. Critically, note that the gradient of the curve\nis also smaller than that for UCB \u2013 corroborating our theoretical insights. We have also illustrated\nthe number of plays by MF-UCB and UCB at each \ufb01delity for one of these problems. The arms\nare arranged in increasing order of \u00b5(M )\nvalues. As predicted by our analysis, most of the very\nsuboptimal arms are only played at the lower \ufb01delities. As lower \ufb01delities are cheaper, MF-UCB is\nable to use more higher \ufb01delity plays at arms close to the optimum than UCB.\n7 Conclusion\nWe studied a novel framework for studying exploration exploitation trade-offs when cheaper approxi-\nmations to a desired experiment are available. We propose an algorithm for this setting, MF-UCB,\nbased on upper con\ufb01dence bound techniques. It uses the cheap lower \ufb01delity plays to eliminate\nseveral bad arms and reserves the expensive high \ufb01delity queries for a small set of arms with high\nexpected reward, hence achieving better regret than strategies which ignore multi-\ufb01delity information.\nWe complement this result with a lower bound which demonstrates that MF-UCB is near optimal.\nOther settings for bandit problems with multi-\ufb01delity evaluations might warrant different de\ufb01nitions\nfor the regret. For example, consider a gold mining robot where each high \ufb01delity play is a real\nworld experiment of the robot and incurs cost \u03bb(2). However, a vastly cheaper computer simulation\nwhich incurs \u03bb(1) approximate a robot\u2019s real world behaviour. In applications like this \u03bb(1) (cid:28) \u03bb(2).\nHowever, unlike our setting lower \ufb01delity plays may not have any rewards (as simulations do not\nyield actual gold). Similarly, in clinical trials the regret due to a bad treatment at the high \ufb01delity,\nwould be, say, a dead patient. However, a bad treatment at a lower \ufb01delity may not warrant a large\npenalty. These settings are quite challenging and we wish to work on them going forward.\n\n8\n\n\u039b\u00d71050.511.522.533.544.55R(\u039b)\u00d710412345678910K=500,M=3,costs=[1;10;100]MF-UCBUCB\u039b\u00d71050.511.522.5R(\u039b)\u00d71050.511.522.53K=500,M=4,costs=[1;5;20;50]MF-UCBUCB\u039b\u00d710412345678910R(\u039b)\u00d71040.20.40.60.811.21.41.61.82K=200,M=2,costs=[1;10]MF-UCBUCB\u039b\u00d710512345678910R(\u039b)\u00d71050.511.522.53K=1000,M=5,costs=[1;3;10;30;100]MF-UCBUCBArmIndex050100150200250300350400450500Numberofplays020406080100120140160MF-UCBm=3m=2m=1ArmIndex050100150200250300350400450500Numberofplays020406080100120140160UCBm=3m=2m=1\fReferences\n[1] Rajeev Agrawal. Sample Mean Based Index Policies with O(log n) Regret for the Multi-Armed Bandit\n\nProblem. Advances in Applied Probability, 1995.\n\n[2] Jean-Yves Audibert, R\u00e9mi Munos, and Csaba Szepesv\u00e1ri. Exploration-exploitation Tradeoff Using\n\nVariance Estimates in Multi-armed Bandits. Theor. Comput. Sci., 2009.\n\n[3] Peter Auer. Using Con\ufb01dence Bounds for Exploitation-exploration Trade-offs. J. Mach. Learn. Res., 2003.\n[4] Yoram Baram, Ran El-Yaniv, and Kobi Luz. Online choice of active learning algorithms. The Journal of\n\nMachine Learning Research, 5:255\u2013291, 2004.\n\n[5] S\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 2012.\n\n[6] Mark Cutler, Thomas J. Walsh, and Jonathan P. How. Reinforcement Learning with Multi-Fidelity\n\nSimulators. In IEEE International Conference on Robotics and Automation (ICRA), 2014.\n\n[7] D. Huang, T.T. Allen, W.I. Notz, and R.A. Miller. Sequential kriging optimization using multiple-\ufb01delity\n\nevaluations. Structural and Multidisciplinary Optimization, 2006.\n\n[8] Kirthevasan Kandasamy, Gautam Dasarathy, Junier Oliva, Jeff Schenider, and Barnab\u00e1s P\u00f3czos. Gaussian\nProcess Bandit Optimisation with Multi-\ufb01delity Evaluations. In Advances in Neural Information Processing\nSystems, 2016.\n\n[9] T. L. Lai and Herbert Robbins. Asymptotically Ef\ufb01cient Adaptive Allocation Rules. Advances in Applied\n\nMathematics, 1985.\n\n[10] Dev Rajnarayan, Alex Haas, and Ilan Kroo. A multi\ufb01delity gradient-free optimization method and appli-\ncation to aerodynamic design. In AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference,\nVictoria, Etats-Unis, 2008.\n\n[11] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society, 1952.\n\n[12] W. R. Thompson. On the Likelihood that one Unknown Probability Exceeds Another in View of the\n\nEvidence of Two Samples. Biometrika, 1933.\n\n[13] Long Tran-Thanh, Lampros C. Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R. Jennings,\nand Peter Key. Ef\ufb01cient Regret Bounds for Online Bid Optimisation in Budget-Limited Sponsored Search\nAuctions. In UAI, 2014.\n\n[14] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing\n\nCompany, Incorporated, 2010.\n\n[15] Yingce Xia, Haifang Li, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Thompson Sampling for Budgeted\n\nMulti-Armed Bandits. In IJCAI, 2015.\n\n[16] Chicheng Zhang and Kamalika Chaudhuri. Active Learning from Weak and Strong Labelers. In Advances\n\nin Neural Information Processing Systems, 2015.\n\n9\n\n\f", "award": [], "sourceid": 957, "authors": [{"given_name": "Kirthevasan", "family_name": "Kandasamy", "institution": "CMU"}, {"given_name": "Gautam", "family_name": "Dasarathy", "institution": "Carnegie Mellon University"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}, {"given_name": "Jeff", "family_name": "Schneider", "institution": "CMU"}]}