
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
(As requested by NIPS, this is a "light" review)
SUMMARY
The papers considers a stochastic bandit problem and analyzes a variant of minimax regret: Instead of studying the regret against the worst set of distributions, they consider for each arm k\in[K] the regret against the worst set of
distributions under the constraint that this arm is optimal. This leads to a richer notion of minimaxregret, seen as a vector in R^K.
The paper studies this notion of minimax regret and provide both lower bounds and upper bounds, expressed in terms of a Paretoregret frontier. The proof for the lower bound is specific and simple enough. The algorithm for the upper bound, called Unbalanced MOSS,
is a generalization of the MOSS algorithm.
Some illustrative examples are provided. They also show that the additional factor K in the regret of MOSS is not just an artifact of the original proof and is observed in practice. They then suggest an unbalanced UCB algorithm.
QUESTIONS AND COMMENTS
The paper restricts to rewards distributed according to a normal distributions. This yields some questions: 1) To which extent your proofs are married with this assumption? What cannot be easily extended? 2) What would be the main modifications for the traditional [0,1] bounded rewards?
3) What about simply dealing with a subGaussian/lighttail assumption?
I believe section 4 could be detailed more, and enriched with further discussions. Page 8 contains many additional remarks that are however not discussed at all. They seem to have been assembled in a rush. Theorem 7 and especially the paragraph on Adversarial Bandits should be discussed more, especially against the stateoftheart.
Figure on page 8 is hard to read when printed in B&W.
As it is, the paragraph on Adversarial Bandit looks misplaced/incoherent with the paper: I suggest to either discuss it at length, maybe in a dedicated section, or skip it.
You introduced/studied UnbalancedMOSS and UnbalancedUCB : What about UnbalancedKLUCB or UnbalancedThompson sampling? (It would make sense to study these at least empirically).
Quality: Looks ok, I haven't checked in detail since this is a "light" review. Clarity: Ok. The paper seems messy and should be polished/reorganized. More intuition/discussion should be provided. Originality: Good. Significance: The paper seems to bring an interesting/useful idea. The results should be discussed more against the stateoftheart and the literature review part is a bit weak.
* I have read the rebuttal and keep my score unchanged.
Q2: Please summarize your review in 12 sentences
The paper seems to bring an interesting/useful idea. The writing is a bit messy and
the paper should be polished/reorganized. More intuition/discussion should be provided as well. The results should be discussed more against the stateoftheart and the literature review part is a bit weak.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: In a multiarmed bandit problem, define the worstcase regret of a strategy for a given action as the largest possible regret incurred when the given action is an optimal one. Further define the worstcase regret vector of a strategy as the vector of worstcase regrets for different actions. The paper fully characterizes, up to constant factors, the set of achievable worstcase regret vectors in the stochastic setting. These regret vectors can be achieved by a variant of the MOSS algorithm by Audibert and Bubeck [2009]. The authors also showed empirically that MOSS does not achieve nearoptimal problemdependent guarantees. Moreover, in the adversarial setting, they proved that a variant of the EXP strategy can lead to a certain family of worstcase regret vectors.
Comments:
Given that the problem considered in this paper is a natural and seemly simple extension of the classic multiarmed bandit problem, it is a little surprising that it has not yet been solved before. The results in the paper are novel and important. I like the paper for three reasons.
First, it is important to quantify the cost of falsely believing that a certain action is optimal and as a result, choosing that action more frequently than its past performance suggests. Second, while the KLdivergence argument is usually used to get a lower bound on the sum of regrets, in this paper, it is applied in a clever way to lower bound the product of regrets. finally, it is interesting to see that a simple but clever modification of the UCB and MOSS algorithms can match the lower bounds.
The paper is well organized and its contribution is clearly stated. However, the proofs in the paper contain many confusing errors and typos, although I think that they could be corrected without modifying the structures of the proofs.
Proof of Theorem 1:  line 140:
\epsilon_1 should be \epsilon_j. Otherwise, the second equality in line 157 would be wrong.
 line 157: one of the two T_k's should be T_1.
Proof of Theorem 7 in the appendix  line 485: the "" should be a "+" and the "+" should be a "".  line 516: B_1 in the third term of the bound should be
B_{i^*}.  line 516518: the inequality \sum_{i=2}^{K} \frac{n}{B_i} \leq B_1 is used. However, if 1 is in the set A_1, then there would be an extra term \frac{2n}{B_1}\sqrt{log(n)} that need to be dealt with.
 line 518: the desired bound should be {B_{i*}}\sqrt{log(n)} instead of {B_1}\sqrt{log(n)}.
To summary, the paper solves an interesting problem by cleverly adapting existing methods and algorithms. It provides some valuable insights into the achievable worstcase regrets in the multiarmed bandit problem. However, the proofs in the paper contain some errors. Although I think that the results of the paper are still correct, the paper is not ready to be published unless the errors can fixed.
After rebuttal:
The author rebuttal has answered my comcern. I think that the paper studies a meaningful and important problem thoroughly by providing matching lower and upper bounds. I raise my score to 7.
Q2: Please summarize your review in 12 sentences
The paper solves an interesting problem by cleverly adapting existing methods and algorithms. It provides some valuable insights into the achievable worstcase regrets in the multiarmed bandit problem. However, the proofs in the paper contain some errors. Although I think that the results of the paper are still correct, the paper is not ready to be published unless the errors can fixed.
After rebuttal: The author rebuttal has answered my concern. I raise my score to 7 and vote for acceptance.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper addresses the minimax bandit problem in the stochastic setting. The case that is targeted is to bound specifically the regret for a particular arm and characterised how this deteriorates the regret on the remaining arms. The authors provide new theoretical material to prove an upper and lower bound on the repartition of the perarm regret that match up to constants. The upper bound is based on the proposed
new algorithm UnbalancedMoss which is based on the Moss algorithm. They show that unlike in the expert case, guaranteeing a small regret for a particular arm comes to a greater cost (than in the expert case) for the remaining arms. Finally they design a new version of UCB (unbalanced UCB), that up to a logarithmic factor is 'unfair minimax' optimal and which at the same time and contrary to the MOSS algorithm possess asymptotic optimal problemdependent regret.
The paper addresses a problem of importance and seems in the most part of high value. My main concern is that I am not sure about the correctness of the lower bound proof. The question raised there should definitely be addressed.
Some part of the proofs are quite compact and makes them hard to verify: passage (c)&(d) of proof of theorem 3. More intuition should be given to ease the reading of the proofs.
About the lower bound:
Equation in line 143: Shouldn't it be an inequality instead of the last equality, as \epsilon_k is defined with respect to R_k which is a sup on R_\mu?
In line 126, it is assumed that \mu \in
[0,1]**K, but in the example given in the proof (the \mu_k) it seems not always the case. I think this is an issue because, as mentioned above about line 143, we need to use
that R_k\geq R_\mu_k. You can see that \mu_k do not belong to [0,1]**K first of all because there are negative numbers (\epsilon_1). Second, isn't there an issue with the constant c? In the end the constant c is put to 4 but it seems that, in some cases, the constant should be smaller than 1 by construction. Indeed we want \epsilon_k \leq 1 so that the \mu_k belong to [0,1]**K. So if the strategy \pi is such that the regret R_k on arm k is linear (R_k=n), as it could happen if the strategy never pulls k, then we would have \epsilon_k=c. So c\leq 1. But this creates a problem with the end of the proof where we need c > 2.
In the passage (a) of line 150 why is there a T_k(n) appearing in the square root? It is not obvious from the mentioned standard entropy inequality.
equation line 157: on the first equality the \epsilon_k should be \epsilon_1 as defined in \mu_1. This seems to create problem for the following reasoning in the same line. In this same line, in the 'last but one' term, I guess it should be already R_k instead of R_\mu_k
More generally, shouldn't the constant c depend on \pi? Because otherwise the Equation 2 is proving that any possible strategy is pulling all the arms at least linearly! (when c=4, as used in the proof).
 Other possible typos in the rest of paper: line 269: In the second part, at the end of the line: where is it proven that \Delta_i\E\tau_i\leq \sqrt{n_i*}. Is it useful? I see the part with n_i but not with n_i*.
line 289: a should be \alpha?
line (d) of Corollary 6. Shouldn't it be \alpha+1 and \alpha+1/2 in the denominator? Then the 'proof by graph' becomes even more fishy if their is a typo in the expression (though I guess it is not changing much).
line 305: (b) it should be \Delta_i\geq 4*sqrt(n_i*) right?
line 311: \tilde \Delta_l is not defined for l=0.
Q2: Please summarize your review in 12 sentences
The paper proposes new original theoretical techniques and results that address the minimax problem in stochastic bandit. The results are of importance and the work seems of high value. However I have an issue with
the proof of the lower bound that needs to be solved for a possible publication. I am willing to change my score upon reading the authors rebuttal.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper deals with the problem of unfair bandits, where we would like to force different minimax regret values for the different arms. The motivation comes from cases where we have prior information that some arms are more likely to be optimal compared to other arms. This problem has been previously studied in the experts setting where the player has access to all of the rewards. The authors provide a lower bound for the problem, along with a matching upper bound ("unbalanced MOSS") for the stochastic case, thus giving a complete, up to constant factors, characterization of the attainable minimax bounds. The characterization is somewhat surprising and shows that the bandit case is much more strict than the expert case, in the sense that forcing a small regret in one arm comes with a heavy price w.r.t the regret of the other arms. The authors further provide a modification of UCB that achieves problem dependent bound, and a modification of EXP3 for the adversarial case.
The problem being dealt with is original and interesting, and definitely deserves an analysis. The techniques used are not exceptionally novel but are clean, elegant, and get the job done. Overall the paper is well written: The motivation is properly explained, the previous works are properly surveyed as far as I could verify, the proofs are overall well explained, well structured and have a nice flow to them.
There is one issue with the proofs that I was unable to verify and would like the authors to explain in their response, and fix in the cameraready version, if accepted. The score I gave the paper is pending on a satisfactory answer. Other than this hiccup I find it to be a solid paper: In page 3, line 150, the inequality marked by (a): This inequality is nontrivial and the main theorem is based on it. It deserves a much more detailed explanation than the one given in the footnote. In particular, please state exactly how P and Q are defined (are they the distributions associated with \mu_1 and \mu_k)? Also, state how is X defined and how does it relate to P and Q.
Finally and most importantly, please provide a formal statement of the used fact, along with a proper citation.
Minor comments / typos: Page 2, definition of {\cal B}: All B_i's should be positive Page 2, line 78: Mention that B_K is the largest value of B Page 3, line 140: Should the "otherwise" option be \eps_j rather than \eps_1 ? Page 4, line 174: B_1 \geq \Sum ... => B_1 \geq (1/8)* \Sum ...
Page 4, just before lemma 3: A suggestion: I would add a 1liner explaining what lemmas 3 and 4 are about: lemma 3 is to lower bound the estimation of \mu_1 and lemma 4 to realize how soon we have a proper upper bound of each mu_i
Page 5, line 222: 2^{k} \Delta => 2^{k+1} \Delta Page 5, line 256: Should be: \forall s' \leq s, \hat{\mu}_{i,s'}  \mu_{i,s'} <=... Page 6, line 305, equation after (b): 2/\Delta > 4/\Delta Page 6, line 310: The verbal description of A is wrong. It contains the arms that are not negligibly suboptimal
Q2: Please summarize your review in 12 sentences
The paper deals with the problem of unfair bandits, where we would like to force different minimax regret values for the different arms. The problem is motivated and has been studied in the expert case. The paper provides a thorough and interesting analysis of the problem, giving lower bounds and an algorithm for a matching upper bound. It has some issues but is overall well written, and interesting.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
All Rev:
Thank you to all reviewers for
carefully reading this submission.
All reviewers noted typos and
minor errors and we are sorry for inflicting those upon you, and thankful
for your perseverance. Fortunately all the errors are relatively minor and
are solved either by fixing the typos or more carefully stating the
required assumptions.
We feel this paper can be made acceptable
without much difficulty and hope that it can get over the bar for NIPS.
Besides polishing, our plan for improvement is:
1. Fix minor
errors/typos and correct the boundedness assumption and restate the upper
bound. 2. Add intuition and extend the lit review. This will come at
the cost of pushing some lemmas to the appendix (in hindsight this is
worthwhile)
See specific responses below.
Rev.
1:
For Thm 1 you are right. epsilon_1 should be epsilon_j. The
other point is right too
For Thm 7, your 2nd last comment seems not
a problem, since i^* is never in A_1 or A_2, but we should point this
out
Rev. 2:
>...line 143: Shouldn't it be an
inequality Yes >...assumed that \mu \in [0,1]**K Yes, this is
an issue. The simplest solution is to drop the assumption of bounded
rewards and restate the upper bound. This makes no difference in the
interesting regime where R_k^n / n <= 1 for all k >...T_k(n)
appearing in the square root? See response to reviewer
3 >...\epsilon_k should be \epsilon_1 as defined in \mu_1 There
is a typo in the def. of mu_k, where eps_1 should be replaced by
eps_j >...R_k instead of R_\mu_k Yes >...shouldn't the
constant c depend on \pi? No. (2) is about the number of pulls of arm k
in environment k >line 269: In the second part... Should be
dropped >line 289: a should be \alpha? Yes >Shouldn't it be
\alpha+1 Yes. Upper bound becomes 8 / \sqrt(e) by simple
analysis >should be \Delta_i\geq 4*sqrt(n_i*) Yes
Rev.
3
We will incorporate your suggestions
Re. line 150: X =
T_k(n) satisfies X <= n. P is a measure on the space of outcomes
given environment \mu_1 and Q is a measure on the space of outcomes
given environment \mu_k (both depend on choices of the fixed
policy). Assume WLOG that k = 2
\mu_1 = (0, eps_2, eps_3,...,
eps_K) \mu_2 = (0, eps_2, eps_3,...,eps_K)
The two differ in
the 2nd term. Def. P_t and Q_t as conditional measures on the reward at
time step t given the history, including arm chosen at time step
t
By the chain rule: KL(P,Q) = sum_{t=1}^n E_P[KL(P_t,
Q_t)].
If the arm at time step t is not 2, then KL(P_t, Q_t) = 0.
Else it is the divergence between two Gaussians with variance=1 and
means differing by 2eps, which is 2eps_2^2.
Therefore
E_P[KL(P_t,Q_t)] = E_P[indicator(I_t=2) 2eps_2^2] and so
sum_{t=1}^n E_P[KL(P_t,Q_t)] = 2eps_2^2 E_P T_2(n)
We will
add this and reference [1] where it is described in
detail.
Rev. 4:
We will proofread the paper for other
such ambiguities.
Rev. 5:
We know the nice paper by
Bubeck&Liu, but don't see the relevance. They bound the Bayes regret
in a prior free way and otherwise focus on the BPR setting where
specific priors are crafted for good frequentist regret with Thompson
sampling. They do not analyse the general case or the effects of the
choice of prior. These results have no implications on the available
tradeoffs to Bayesian algorithms.
There is an arxiv paper by
Liu&Li (unavailable at submission time). They study a similar
problem with two arms and focus on Thompson sampling. We will include a
reference/discussion to this paper.
We agree a Bayesian approach
is natural, but one needs to understand the effects of the prior on the
regret. We characterises what is possible for *any* algorithm.
Problemdependent regret is important, but we find the logarithmic regime
less interesting here because it does not match the motivation to
design algorithms that come with solid guarantees on the regret
regardless of the specific problem. The Liu&Li paper also focusses
on the worstcase.
We extended MOSS to obtain matching bounds.
Generalising other variants of UCB is quite easy.
See response to
reviewer 2 for issue with bounded means.
We hope to see an increase
in your score.
Rev. 6:
>...married with this
assumption? This is for the lower bound only, which hold for most
natural noise models, like Bernoulli.
>...modifications for the
traditional [0,1]... For the lower bound you need to compute and bound
the KL divergence for Bernoulli rewards, which is done in [1] and would
not change the nature of the bounds.
>...subGaussian... Lower bound doesn't hold for
distributions much more concentrated than a gaussian. Upper bounds hold
essentially unchanged.
The proof for unbalanced UCB is general and
can be carried through for any algorithm with suitable bounds on E
T_k(n), including KLUCB/UCBV and possibly Thompson sampling.
[1]
Auer et. al. Gambling in a rigged casino: The adversarial multiarmed
bandit problem

