
Submitted by
Assigned_Reviewer_2
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors apply the Optimistic Mirror Descent
algorithm to a variety of problems where smoothness induces a certain
amount of predictability in the resulting loss function sequences. In
particular, better rates are possible when either: 1) the objective G is
fixed for all t, and smooth, or 2) the problem is a saddle point problem
on a function phi(f, x) which is appropriately smooth for each of its
arguments individually, and the optimistic mirror descent algorithm is
played against itself. These are rather specialized notions of
predictability, but they lead to some interesting results. Section 4
extends this to zerosum games, and Section 5 considers approximate smooth
convex programming.
Unlike many applications of online algorithms,
the goal in most cases in this paper is to solve a batch optimization
problem. This should probably be made more explicit in the introduction.
Section 3: The author’s need to be more clear about why this
section improves on the result from Lemma 3. The key point (I think) is
that the smoothness assumptions of Cor 5 may be satisfied even if G(f) =
inf_x phi(f, x) is not smooth. Note that an alternative approach to
solving the problem of Sec 3 is to run an online algorithm against a
bestresponse adversary. It is worth asking whether or not any online
algorithm paired with a bestresponse oracle can lead to the convergence
rates like 1/T produced by the procedure introduced here. (Perhaps not,
since this would appear to destroy predictability).
Section 4: Be
more clear on how this section improves on Section 3. The key seems to be
adding robustness to an arbitrary adversary in addition to the good bounds
for the predictable one. The results of Sec 4.2 are rather weaker than the
section title suggests. I would suggest describing this setting as
multipoint bandit feedback, and citing “Optimal Algorithms for Online
Convex Optimization with MultiPoint Bandit Feedback”. Based on the
notational similarities, it seems quite likely the authors are aware of
this work. It really should have been cited in the submission. It also
seems somewhat surprising that 4 points are needed, rather than just 2 
this should be discussed.
Section 5: It’s not clear to me that one
can assume F^* is known WLOG. At the very least, the binary search
procedure needs to be made precise: Are we running the algorithm of
section 5 at each step of the binary search? If so, how do you check
whether the guess of F^* is too high or too low? How does using a guess of
F^* that is off by some amount delta impact the final bound of Lemma 8?
This carries over to Sec 5.1 as well of course.
Also, Eq (11)
appears to fit into the setup of Sec 3 (Cor 5), why do we need the
machinery of Sec 4.2 here (Lemma 7)? Is Line 368 a typo?
In
general, the paper suffers from some typos that made it more difficult to
review, and could also probably use a bit more discussion of related work
(several particular citations are recommended below).
Local
points:
 Line 112: This simple intuition and analysis for Mirror
Prox is very nice.
 Line 93: This bound can be seen as a direct
generalization of the result that the “Be The Leader” (BTL) algorithm
suffers no regret, since you arrive at a BTL algorithm by taking Delta_t =
M_t. This is worth mentioning, as well as a citation to “Efﬁcient
algorithms for online decision problems” (Adam Kalai and Santosh Vempala).
 Line 131: I would find the section title “SaddlePoint
Optimization” more clear.
 Line 154155: citing “Adaptive game
playing using multiplicative weights” (Yoav Freund and Robert E. Schapire)
is probably appropriate. This citation is also appropriate at lines
208209.
 Line 223: More detail needed, explain exactly why Sec 3
doesn’t lead to this result (because it doesn’t handle an arbitrary
adversary as well?), as well as why a direct application of mirrorprox
does not provide strong decoupling.
 Line 225, also lines 265 and
268: It’s imprecise to refer to regret bounds like 1/T or 1/Sqrt(T), you
need to multiply through by T to get regret as defined on line 65.
Alternatively, refer to these as convergence rates.
 Line
312314: Typos in both “Build estimates” sections of the pseudocode, you
want r^+  r^, as written all the a’s and b’s are zero.
 Line
366: Epsilon hasn’t be defined yet, introduce as an arbitrary constant eps
> 0 earlier in the statement.
 Line 368: Do you mean Corollary
5, not Lemma 7? If Lemma 7 is really intended, then the procedure needs to
be made more clear, e.g., what is the matrix A here?
 Appendix:
The proofs make extensive use of copypaste, which makes them much harder
to read because there is so much repetition. Please restructure the
proofs so they are less verbose and the main arguments are more
clear. Q2: Please summarize your review in 12
sentences
The paper analyzes some new applications of Optimistic
Mirror Descent, in particular, a simple analysis and intuition for fixed
smooth functions, and an analysis applicable to selfplay of the
algorithm. The results seem novel and interesting.
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Optimization, Learning, and Games with Predictable
Sequences
Summary: The paper discusses Optimistic Mirror
Descent (OMD). This is a variant of the classical algorithm for online
convex optimisation that is especially good when the sequence of loss
function gradients exhibits some predictable regularity that can be
exploited by the learner.
The authors first give the specific
substitution for which OMD reduces to Mirror Prox, and generalise the
known bounds to Hoelder smooth functions. The paper then proceeds with the
analysis of objectives with saddlepoint structure. It first discusses
plain minmax optimisation by a single algorithm. It then turns to a
gametheoretic viewpoint, and discusses how two opponents can arrive at a
saddle point faster when both players use the proposed algorithm, while
maintaining the worstcase sqrt(T) regret without cooperation. The paper
concludes by considering approximate convex optimisation with an
application to approximate max flow.
Discussion:
This
paper on prediction of predictable sequences is well executed and rich.
The exposition is very clear, and the results are strong. I very much like
the application to saddlepoint finding, where the players cannot assume
cooperation but there is a speedup if it happens.
Small
things: Page 5, proposition 6. In your notation you access x_0 and
f_0, but these are undefined. page 7, (11). Here x[i] should be x_i.
Q2: Please summarize your review in 12
sentences
This beautiful paper investigates prediction with
predictable sequences (of gradients) and obtains strong results with
interesting technical applications. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper provides applications of online learning
with predictable loss sequences. First earlier results are generalized to
Holdersmooth loss functions. The framework is then applied to compute
solutions to saddlepoint type problems, as well as to efficient
computation of equilibria in twoplayer zerosum games (for cooperative
players, with guarantees if one of the players does not cooperate).
Finally, convex programming problems are considered, but here I find the
underlying assumptions strange.
I think the problems considered
are of natural interest, and the approach is novel. It is somewhat
misleading though, in some sense, that the "predictable sequence
framework" uses, as prediction, past observations, and smoothness to
quantify how good the predictions are.
The paper is generally
wellwritten and clear. However, I have two major criticisms which need to
be addressed in the authors' response:
1. All learning rates in
the paper are set with peeking into the future: \eta_t is a function of
\Nabla_t, which only becomes available after an update with \eta_t is
performed, and the next gradient (loss) \Nabla_t is revealed.
2.
Section 5: I do not understand the assumption that F^* can be known in
advance, or can be found using binary search. Why would verifying the
existence of a solution with a given value be any easier than actually
finding such a solution? So why is it a valid assumption to know the exact
(not approximate) value of the maximum without actually knowing an optimal
solution? (In which case the whole following optimization process is
pointless.)
Minor comments:  1. It does not seem to be
necessary to give the proof of Lemma 1, copied verbatim from [9], except
for the unproven first line in the statement. Instead, Lemma 3 of [9]
should be referred to.
 2. Proof of Lemma 2: line571: it seems
that 1/eta_1 \le 1/R_max is implicitly assumed, which is the same as
assuming the norm of the gradient is bounded by 1. Please make this
explicit (or avoid this bound).
 3. line 11: It seems that \rho
should be 1/\sqrt{H} to cancel the first term in (3), deteriorating the
bound in line 113 worse.
 4. Explain the inequalities in the
derivations, such as lines 587 and 590, 637, 645, 655, 659, etc. These are
not necessarily hard, but sometimes simply the inequalities are so long
that it is hard to find the differences...
 5. Corollary 5:
define the norms _F and _X
 Proposition 6: equation (9) bounds
the average (or normalized) regret. The same for lemma 7.
The
authors might also be interested in the recent ICML paper Dynamical Models
and tracking regret in online convex programming by Eric Hall and Rebecca
Willett (JMLR W&CP 28(1):579587, 2013), which considers
predictability in a related, albeit somewhat different setting.
ADDED AFTER THE REBUTTAL PERIOD:  learning rates: the
corrections seems ok  Section 5: knowing F^* in advance: as I
understand the response, F^* can only be known approximately, and this
approximation has some price in the bounds. So I would not say knowing F^*
can be assumed without loss of generality. However, the explanation given
would be sufficient.  bound in line 113: you are right, it was my
mistake. Q2: Please summarize your review in 12
sentences
I think this is a solid paper with several interesting
results.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for the detailed comments. Let
us address the particular points raised in the reviews:
Reviewer
1:
** Section 3: The author’s need to be more clear about why this
section improves on the result from Lemma 3.
As the reviewer
correctly points out, the key observation in Section 3 is that faster
rates may be obtained even for nonsmooth functions G that can be
represented in the form G = inf_x phi(f,x) with smooth \phi. A standard
example is the l1norm function. We will make this point more clear.
** Section 4: Be more clear on how this section improves on
Section 3.
Indeed, the key to Section 4 is ``robustness'' with
respect to adversarial sequences vs benign ones. This robustness is
achieved through adaptive step size. The reason we need 4 queries is that
2 are used for the predictable sequence and 2 for the current loss. We do
not know whether this is necessary.
** Section 5: It’s not clear
to me that one can assume F^* is known WLOG.
Two reviewers asked
about this, so let us clarify. Irrespective of what value of F* we choose,
the solution f^_T returned by the algorithm will have the objective value
of (1 epsilon /gamma)F*. However, if the problem with (1 epsilon
/gamma)F* is infeasible (that is, this value is larger than OPT), the
solution will fail at least one of the constraints. Thus, to check whether
an objective value is larger than OPT or not, we simply run the proposed
algorithm for the prescribed number of rounds T (which only depends on
epsilon and not F*) and at the end of the T rounds check the constraints.
This way we may perform binary search. Finally one obtains an approximate
solution by paying only an extra log(1/epsilon) factor. This is a standard
step (see e.g. [5]) and we will make this explanation more explicit in the
text. Further, and importantly, making gradient steps and projection on F
is cheaper than solving the original problem.
** Eq (11) appears
to fit into the setup of Sec 3 (Cor 5), why do we need the machinery of
Sec 4.2 here (Lemma 7)? Is Line 368 a typo?
Yes, we thank the
reviewers for pointing this out, this is a typo. We meant to refer to
Lemma 4.
** Line 223: More detail needed, explain exactly why Sec
3 doesn’t lead to this result (because it doesn’t handle an arbitrary
adversary as well?), as well as why a direct application of mirrorprox
does not provide strong decoupling.
Note that the predictable
sequence in mirror prox is not something that player 1 observes: it is
some internal knowledge known only to player 2. On the other hand the
strategy we propose only uses the observed payoff vector from round t1 as
the predictable sequence. Second justification is that variable stepsize
is needed for adaptivity to any adversary while retaining fast rate versus
honest protocol.
Reviewer 3:
** All learning
rates in the paper are set with peeking into the future: \eta_t is a
function of \Nabla_t, which only becomes available after an update with
\eta_t is performed, and the next gradient (loss) \Nabla_t is revealed.
We thank the reviewer for noticing this! The result is still
correct, but eta_t should be replaced by eta_{t1}, as is currently
written in the bandit version. The proof only needs the following minor
edits :
 When both players are honest, exactly the same proof
works since the bound for eta_t^1 used there holds for eta_{t1}^1 as
well  When opponent is adversarial, we get an extra telescoping sum
after applying Holder with eta_t. That is, only the last 2 lines of proof
change and we get instead :
sum_t (eta_t^{1} eta_{t1}^{1})
f_t g'_t^2 \leq R^2 sum_t (eta_t^{1} eta_{t1}^{1}) \leq R^2
(eta_T^{1} eta_1^{1})
Hence, overall we get an extra factor of
2 in the bound. Similar rectification needs to be made for corollary 2 as
well. These changes will be reflected in the final version.
**
Section 5: I do not understand the assumption that F^* can be known in
advance
Please see the response to reviewer 1.
** line 11:
It seems that \rho should be 1/\sqrt{H} to cancel the first term in (3),
deteriorating the bound in line 113 worse.
We believe the
statement is correct as written. Notice that H becomes squared, which was
perhaps missed by the reviewer.
 