Submitted by
Assigned_Reviewer_3
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper studies the power of adaptive adversaries in
full information online learning and bandit problems. More specifically,
adversaries with switching cost, with memory of size 1 and with bounded
memory are considered. Several matching lower bounds on the policy regret,
which is a more suitable notion of regret for adaptive adversaries, are
presented.
The main result is a new lower bound for bandit
problems with switching costs that matches the existing O(T^{2/3}) upper
bounds. This implies similar lower bounds for bandit problems with bounded
memory adversaries. From this lower bound, they obtain a O(T^{2/3}) lower
bound for full information problems with bounded memory adversaries that
matches the existing upper bounds. The lower bound proof and the reduction
from the full information to the bandit problem are interesting. Finally,
when losses are stochastic and iid, authors prove a O(\sqrt{T log log log
T}) regret bound for the bandit problems with switching costs with only
O(log log T) switches, which is somewhat surprising.
Unfortunately, these lower bounds are proven for a larger class of
problems with "bounded range" and "bounded drift". Thus, the lower bounds
do not apply to the standard (and smaller) class of bounded loss
functions.
The paper is wellwritten and most proofs seem to be
correct. Q2: Please summarize your review in 12
sentences
The paper studies the power of adaptive adversaries in
full information online learning and bandit problems. It is wellwritten
and a nice contribution to nips. Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The main contributions of the paper include providing
lower bound of T^{2/3} for the problem of online learning with switching
costs and bandit feedback. This matches the upper bound known from his
problem. It is a somewhat surprising result as it is known that the full
information version of this problem enjoys a sqrt{T} regret bound. Further
through a novel reduction of the problem of online learning with bounded
memory adversaries (under policy regret) with full information to the
problems of online learning with switching costs with bandit feedback, the
authors show a T^{3/2} lower bound for learning against bounded memory
adversaries even in the full information case. Also it is shown that while
faced with an iid adversary however one can still achieve a O~(sqrt{T})
rgeret bound with only O(log log T) switches for the switching costs
bandit problem.
My main concern about the result is the O(log t)
drift allowed in the lower bound. This is a bit unnatural. Further looking
at where this comes from in the proof it seems like the statement might be
true even with constant drift. Is it perhaps possible to get a worse lower
bound of T^{2/3}/log T while keeping drift constant ? Another direction
could be to try to take Xi's to be Rademacher random variables but I wasnt
able to track what happens to lemma 1.
Another way to perhaps
alleviate this issue might be to allow adversary to be random with the
restriction that expected drift is bounded by constant which is true in
your case.
None the less I find the results compelling even with
the quirk of log t order drift allowed. Q2: Please
summarize your review in 12 sentences
The paper is wellwritten. The results are very
interesting ( and even surprising) and of definite interest to the
theoretical machine learning community. I recommend the paper for
publication.
Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper studies the problem of prediction with
expert advice and finds matching upper and lower bounds on regret in terms
of T (of T^{2/3}) in a number of settings: bandit feedback with
adversaries of the form (bounded memory, size1 memory, and imposing
switching costs) and fullinformation feedback with adversaries with
bounded memory.
This is a fundamental question, and it's nice to
see progress here.
There are a couple of unusual things to note
about the setting:  The definition of expected regret is not the
usual one. Traditionally, one looks at the expected difference between the
costs of the selected actions and the performance of the best fixed action
in hindsight (meaning given the actual history of play). Instead, they
look at the expected value of the difference between the actual
performance and that of the best fixed action (not taking the actual
history into account, but rather assuming that the fixed action had been
played all along). This difference is salient because they consider
adaptive adversaries who e.g. take into account a limited history when
selecting their cost functions. The regret definition used here ("policy
regret") seems to be the right one for these settings, and the paper
mostly does a good job of making this distinction. But since this is a
potential source of confusion, I would have preferred if the paper had
been more explicit with references to the cases where similar results are
known under the standard expected regret. Also, it's strange that Table 1
summarizing results doesn't provide standard references for the results in
prior work.  The results require a weakening of the usual assumption
of bounded loss values (and hence the reproving of the corresponding upper
bounds, which the paper does do). I still think the results are
interesting, but of course it would be nicer to have results in the
standard model.
The paper gives a thorough treatment, with nice
proof sketches of the main results in the main body of the paper, and a
nice discussion of open questions.
Q2: Please
summarize your review in 12 sentences
This paper makes progress on interesting fundamental
questions, and is presented well.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their comments and
suggestions.
Assigned_Reviewer_5: Thanks for the suggestions
regarding constant drift. We agree that it would be more aesthetic if the
drift were constant, but we currently don’t know how to make it work. The
difficulty is to show that the player doesn’t get any information by
staying on the same action, and only gets a little bit of information by
switching actions. We know that replacing the Gaussian random walk with a
Rademacher random walk doesn’t work, but we will definitely keep thinking
about the other suggestions.
Assigned_Reviewer_6: We will try to
make the distinction between standard regret and policy regret as explicit
as possible throughout the paper. We will add references to prior work in
Table 1 where appropriate.
