NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 957 The Multi-fidelity Multi-armed Bandit

### Reviewer 1

#### Summary

The paper models multi-armed bandit problems where there is an 'fidelity' component to be chosen by the learner (from among a finite set of fidelities) for the arm that is being pulled. An arm pulled at a certain fidelity yields a stochastic iid reward with mean within zeta_m of the true mean, where the zeta_m parameters (errors) are known a priori, and where the highest fidelity level corresponds to zeta = 0. In addition, smaller fidelity pulls are cheaper than higher fidelity pulls (i.e., cost less). The authors frame the problem of sequentially pulling arms with fidelities so as to maximize a notion of net reward, give a new algorithm for the problem, and argue that it can outperform UCB operating at only high fidelity levels. The paper also gives a lower bound for the regret of the proposed algorithm.

#### Qualitative Assessment

The paper in my opinion studies an interesting and relevant problem - one of modelling the tradeoff between information, cost and reward (whether to choose low information that is cheap or high information that is expensive) - in online learning , specifically stochastic bandits. In this sense it may be useful as a benchmark to improve upon. Though the paper seems technically solid, a key shortcoming is the lack of adequate explanation about its results and assumptions. The regret definition adopted seems unnatural at least from one angle - why not penalize resource consumption (or 'cost') additively instead of multiplicatively as done here? The authors' example of ad-display motivates their definition, but may not be the most general. What happens when the cost paid for suitable fidelity observations is additively factored in total reward (or regret) is perhaps a more natural question, and would be good to know. The introduction of Assumption 1 seems rather artificial/opaque, involving constraints on the fidelity parameters given to the learner. Why would the authors expect this to hold in a general learning problem? More importantly, if the zeta_m's don't decay as fast as required by Assumption 1, is there a general way of selecting a subsequence of fidelities to satisfy the assumption and hence the results (this would significantly help strengthen the paper's message)? Though the authors mention that the assumption is not crucial, it seems to be used in the proofs, and my main concern is whether the assumption influences the algorithm's design and performance in a critical fashion. It appears that the algorithm's design is tailored towards performing well for some kinds of zeta configurations, but the justification of why these are the 'right' zeta sequences is missing. Again, the gap in the lower bound also demands more investigation and perhaps a re-design of the proposed strategy. Interpreting the main result (Thm. 2, the regret bound for MF-UCB) and contrasting with UCB (at highest fidelity? (this is not explicitly stated in the discussion)) is not easy to grasp due to the highly technical nature of the presentation. The relative gain/loss of MF-UCB seems to depend on Delta and [[k]] which in turn depend on the rather complicated sequence of sets \cal{K}^(m). A more explicit explanation would greatly help the authors' case. Overall, I feel the paper gets off to a good start in terms of considering an interesting learning model, but rapidly loses clarity when it comes to presenting the results. Minor typos: l 56: "near-optimal" l 126: period unnecessary before "for all m < M" l 179: set notation (curly braces) would be better to define [[k]]

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 2

#### Summary

This work introduces a new multi arm bandit setting where each arm can be drawn with different accuracies (or fidelities). The cost of pulling an arm increases with the desired accuracy, and the regret is defined as the product of the cost by the usual pseudo regret. The authors propose an algorithm which address this setting, MF-UCB, and provide an upper bound and a lower bound on the cumulative pseudo-regret of this algorithm.

#### Qualitative Assessment

The main paper is clear and well written. However this is not the case of the supplementary materials, where most of the proofs are. they contain many typos and small, easily correctable errors which can confuse the reader (see below for a non exhaustive list). The correction of those errors would greatly increase the readability of the paper. Additionally, I think the up-to-a-multiplicative-constant signs used in the paper can be a bit misleading, as constants are not really constants : for instance in Theorem 2, there is a factor \rho (a parameter of the algorithm) which is hidden in the constant. Using the exact inequalities (which are written in the supplementary materials) could improve the clarity of the results. From a practical point of view, it is not clear to me that this algorithm can easily be applied to real life problems. For instance the strategy of MF-UCB (using low accurate pulls to eliminate greatly suboptimal arms) could suffer from inaccurate estimation of the \zeta (the fidelity accuracy). I think that a discussion on the practical limitation, and how to overcome them, would benefit this work From a theoretical perspective, this paper provides a nice insight into the problem, as well as interesting theorems and solid proofs. Overall I think that this work is a significant and interesting contribution. List of typos/errors in the appendix : - page 10 equation (8) ; k should be k* - Section A1 : the H^(m) should be K^(m) - page 10 line 328 : \nu /(rho-2) -- > \nu/2 - page 11 equation after line 343 : s -- > \mu_* - page 11 equation (10) : \gamma -- > \gamma^(m) - page 11 equation (11) : there is an extra ',' - page 12 line 375 : Lemma -- > Theorem - page 12 last line and page 13 first equation : parentheses are missing - page 13 : in the definition of x_n, K^2 -- > K - page 13 : in the definition of y_n, ^{1/2} -- > ^{1/(\rho-2)} - page 13 line 388-389: The sequence of logical implications seems false to me, I think it simply is a consequence of \rho > 4 - page 13 in equation (15) : x -- > x_{n,\delta}. Same for y and  \delta - page 14 line 406 : N -- > log(N) - page 14 equation after line 406 : N < 2n --> N > 2n - page 14 equation before line 408 : \mu_* should be removed. The additive constants +1 and \kappa_k are missing - page 15 : p and m are interchanged several times. - page 15 , line 432 : the opposite inequality should also be verified - page 15 equation after line 445 : The \tilde are missing - page 15 line 448 : \cup -- > \cap - page 15 equation after line 448 : the ^(l) were dropped - page 16 equation after line 451 : the P() is missing the the rightmost term - page 17 line 454 : theorem -- > lemma - section C : distribtuions -- > distributions

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 3

#### Summary

The paper introduces a "multi-fidelity" version of the multi-armed bandit problem in which there are K arms, each of which can be played at M different levels of "fidelity". Lower fidelity is less costly, but leads to potentially greater bias in the estimate of the arm's reward. (Pulling an arm at fidelity m draws a sample from a distribution whose expected value deviates from the arm's true expected reward by an amount whose upper bound depends on the fidelity.) The paper analyzes the natural UCB-based algorithm for this problem, where the width of the confidential is increased to compensate for the uncertainty stemming from the bias of sampling at low fidelity. The analysis of the algorithm is summarized by theorems containing messy regret bounds that are hard to interpret; the paper argues that this regret bound will often be significantly better than the regret bound attained by always playing arms at the highest fidelity and using the standard UCB algorithm to select which arm to play. A brief section near the end of the paper presents simulation results to substantiate this claim.

#### Qualitative Assessment

I was initially puzzled as to why the paper chooses to model lower fidelity as introducing more bias, rather than introducing more variance. (The latter interpretation of "low fidelity" also seems well motivated to me.) The motivating examples given in the paper (e.g. algorithm selection for machine learning problems) convinced me that modeling low fidelity as greater bias is well motivated, although I'm still not convinced by one of the motivations (namely, the contention that displaying an ad for a shorter time interval yields a biased estimate of the ad's effectiveness over longer intervals) and in the example of algorithm selection for machine learning, it's not clear why minimizing regret would be the goal of an algorithm designer in that application. So to sum up, I thought that there was a disappointingly weak connection between the model in this paper and the motivating applications, but not so weak as to constitute a disqualifying flaw. The main innovation in the paper seems to be the formulation of the multi-fidelity multi-armed bandit problem itself. The algorithm is a fairly typical application of the "optimism in the face of uncertainty" principle. The performance guarantee, as noted earlier in this review, strikes me as disappointingly hard to interpret. The simulation results on synthetic problems are hard for me to assess: Appendix C spells out in greater detail the set of parameter values that were used in the simulations, but does not attempt to justify these choices of parameter values. The regret of MF-UCB seems to improve that of UCB by a factor of 3 to 4 in many of the simulations; should we expect this to be typical of the relative performance of the two algorithms in practice, or is it an artifact of the way parameters (e.g. values of \zeta^{(m)}) were chosen?

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Summary

The author provide a new framework called Multi-fidelity Multi-armed bandit. In this problem, arms can be sampled with different level of fidelity and a cost is associated to each one. While having a lower cost, low fidelity samples are drawn from a distribution approximating the highest fidelity distribution. They provide an algorithm adapting to the sequences of available approximations and cost, its regret analysis and a lower bound of the problem. They provide numerical simulation on synthetical problems where the proposed algorithm outperform UCB.

#### Qualitative Assessment

The proposed framework is interesting and if more developed can be impactful. However in this state, I found the paper unfit for publication. 1/ Introductions of the notations are quite confusing, they are very dispersed and makes their definitions hard to find. Per example, I did not find what is $s$ (after the rebuttal, it appears that $s$ is defined 1 page after its first use and is then barely used in the rest of the paper). 2/ The use of the advertising as example is misleading at some points. At many places, correlations with the display time of an ads and non stationarity are made. Non stationarity is out of context in this paper. It may be a good idea to use another example (see 4/). 3/ I find hard to justify the regret analysis in this setting. A regret analysis involves to maximize a gain in the long run and thus to find the arm with the best ratio reward/cost. Here the optimal policy is always the arm with the highest mean, whatever the associated cost is. Thus, the optimal policy is weakened by the obligation of sampling the best arm at lower fidelity level. This may hide the cost of sampling low levels when it is not needed. 4/ A sample complexity analysis (here more a cost complexity) may be useful in the context of Multi-fidelity Multi-armed bandits. It can be motivated by the testing phase of prototypes for the conception of new products. The goal is to find with high probability the best product while minimizing the cumulated cost of the test phase before mass production. As companies are bounded by regulations, contracts and existing facilities they are obligated to use low-fidelity approximations before accessing to advanced and costly tests. 5/ I also found some parts of the text unwieldy. It may be worth to make a pass on the paper to make it more streamlined.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 5

#### Summary

This paper aims a variant of the classical stochastic K-armed bandit problem where the cheap approximations to each arm is available. The authors formalise this problem as a multi-fidelity bandit problem. The theoretical results on regret bound are provided. Simulations are performed to demonstrate the effect of multi-fidelity.

#### Qualitative Assessment

The presentation is easy to understand. Simulations demonstrate the effect of MF bandit method. There are sufficient theoretical analysis on the lower bound of the proposed method, which shows the theoretical improvement.

#### Confidence in this Review

1-Less confident (might not have understood significant parts)

### Reviewer 6

#### Summary

This paper studies a new multi-armed bandit setting. The setting is that there are $M$ fidelities at which the learner can choose to play an arm. Each successive fidelity provides a better approximation to the rewards but expends larger cost. If a strategy can use the lower and cheaper fidelities to eliminate several sub-optimal arms, while reserve higher and more expensive fidelities for a small subset of good arms, then it may outperform the standard UCB algorithm that neglects these characteristics of the fidelities. The setting could be applied to online advertising, where one may like to display ads for a few minutes to approximate the clicks in long run, for cost concern. For this problem, the authors consider a new regret definition, provide an algorithm with theoretical analysis (including regret lower bound), and conduct simulation. The analysis is novel.

#### Qualitative Assessment

As each \varepsilon^{(m)}, the deviation of the m_th fidelity mean from the one of the highest fidelity, is assumed to be known in advance, how do we know the values in practice (such as the application in online advertising)? Is it possible to relax the assumption? Minors: 1) line 326, 357, and 386: "... arm k \in H^{(m)}", 'H' should be 'K'. 2) line 343: For the last indicator term in the inequality, the argument "\Beta_{k^\ast,t} \leq s", 's' should be \mu_{\ast}. 3) line 345: \gamma ---> \gamma^{(m)} 4) line 346~347: "... and the last step uses \psi( \Delta_k^{(m)} - \gamma^{(m)} ) \leq \psi( \gamma^{(m)} ) when \psi( \Delta_k^{(m)} ) \leq 2 \gamma^{(m)}", the two '\leq's should be '\geq's. 5) line 384: For the inner summand on the left side of the first inequality, the summation over all k \in those "above" K^{(m)} should be over all k \in those "below" K^{(m)}. 6) line 407: On the right side of the last inequality, \delta^{(m)} should be \delta^{(M)}, and the last term \mu_{\ast} \kappa \lambda^{(m) } should be \kappa \lambda^{(m)}. 7) line 431: "for l \ge m" ---> "for l \ge p" 8) line 440: in eq(17), "l \geq m" ---> "l \geq p" 9) line 450: P(A_{n,3}) should be P(A_{n,2}).

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)