
Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes and analyzes a method for learning
in robust MDPs. While this setting is very similar to learning in
stochastic games, the main difference is that in robust games, the optimal
move of the opponent is observed, while in robust MDPs the decision maker
only observes the outcome (the opponent chooses the probabilities).
The paper makes a small advance on a relevant, nontrivial, and
interesting topic, but I am not sure that it is quite ready for
publication in its current form.
First, the setting is somewhat
contrived and not motivated. A natural setting would be simply to use
reinforcement learning to learn to act in a robust setting. Instead, the
authors assume that the robust uncertainty sets are known in advance,
while the stochastic probabilities are not known. This choice is not
justified by neither applications nor by the difficulty of the natural
case. The analysis is not very novel either, with Lemma 2 being the most
significant difference with prior work. Addressing the natural setting
would make this a much stronger paper.
There are several minor
issues too.
The behavior of the adversary is not welldefined
during the execution of the algorithm. As defined on page 4, the adversary
should take the minimumreturn transition probabilities for the current
policy. Therefore, the transition probabilities for the adversarial states
should only change after a policy change. The algorithm does not take
advantage of the fact and seems to assume that the adversary can simply
behave arbitrarily – or perhaps behaves in order to maximize the regret.
It would be good to clarify this issue.
I find the distinction
between adversarial and stochastic states artificial and unnecessary.
Stochastic states are simply a special case of adversarial states with a
single distribution in the uncertainty set. In addition, adversarial
states may behave as stochastic if the worstcase distribution chosen by
the opponent is always the same; it is impossible to tell the two apart in
that case.
I did not find the experiment useful – it is not
reproducible, not well motivated, and does not really show anything
interesting. It may be better to remove it and replace it by some of the
proofs that are currently in the appendix.
The proofs are correct
– if a bit tedious – as far as I can tell. The presentation could be
improved in emphasizing the main difference in the proofs in comparison
with the standard stochastic MDP results. Q2: Please
summarize your review in 12 sentences
The paper makes a small advance on a relevant,
nontrivial, and interesting topic, but I am not sure that it is quite
ready for publication in its current form. The setting is contrived, the
methods used are quite standard and very similar to ordinary analysis for
stochastic MDPs. Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Additional comments after author feedback:
I
thank the authors for answering my questions. I now understand how the
proposed algorithm works. I do not have any concerns anymore regarding the
stochastic check. I see how it may be eventually triggered.
================================================ Original
review:
The paper describes a new reinforcement learning technique
that can handle adversarial scenarios. This is a very important problem
that has not been investigated very much. The proposed algorithm is to my
knowledge the first that can learn whether each parameter is adversarial
or stochastic and infer a robust policy based on that.
The paper
is generally well written however, some important aspects of the proposed
algorithm are not explained. I also have some concerns about the
stochastic check.
In Figure 3, why is Q the minimum of Tt and a
usual backup? If the rewards are always greater than 1, then Q will always
be Tt and the algorithm won't be learning anything... I suspect that
there is a simple explanation, but I'm not sure...
Also, in Figure
3, how is \hat{P} estimated? I couldn't find any description of how the
transition distributions are estimated. Is it simply the relative
frequency counts based on the trajectory observed?
I'm also having
trouble convincing myself that the stochastic check is reasonable. It says
that if the difference between the expected future rewards at (s,a) when
computed in two different ways is greater than some threshold, then
Pr(s's,a) must be adversarial. The threshold is an expression that
depends on n and tau, which means that it grows over time. In fact, at
some point it will be greater than the difference between Vmax and Vmin
where Vmax = T*Rmax and Vmin = T*Rmin. This means that if the agent has
not detected an adversarial behaviour by this number of steps, it will
never find which parameters are really adversarial. This doesn't seem
right. What will happen if the adversary behaves stochastically long
enough to ensure that it is not detected and then reveals its adversarial
nature after that. Would it manage to inflict arbitrary regret?
The empirical evaluation consists of a proof of concept since it
includes only a single toy problem that was handcrafted to demonstrate
that UCRL2 will fail to converge to a good policy. I encourage the authors
to include a more thorough empirical evaluation with several problems, a
variation in the degree of adversarial behaviours and to report the
running times. Q2: Please summarize your review in 12
sentences
The paper tackles an important problem: how to learn
which parameters are adversarial vs stochastic in RL. The work is well
motivated and interesting. Although the empirical evaluation consists of a
proof of concept, the paper makes an important theoretical contribution.
This is also the first work (to my knowledge) that learns to distinguish
adversarial vs stochastic parameters in RL. Submitted by
Assigned_Reviewer_9
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
In this paper, the authors consider a
reinforcement learning (RL) problem where transitions from some
stateaction pairs can be governed by an adversary. They propose provably
efficient RL algorithms for this context. The proposed RL algorithms are
variants of the UCRL2 algorithm of Jaksch et al. 2010, with an additional
"stochasticity check" to detect adversarial stateaction pairs. The
authors also prove that the proposed algorithms achieve sublinear
cumulative regret relative to a reasonable minimax policy. I believe the
results are novel and significant for two reasons. The problem addressed
is distinct from those addressed in the literature, as far as I know. The
paper is wellwritten and enjoyable to read, and the analysis seems to be
rigorous and technically correct. Perhaps one weakness is that the
proposed algorithms require knowledge of $U(s,a)$ (the action space of the
adversary). It may be unrealistic to assume that the algorithm knows
$U(s,a)$'s precisely at the beginning. The authors should also provide
more information about the experiment (maybe in the appendix). It is not
clear to me whether or not the current experimental results are
representative of what one should expect to see more broadly.
Some
technical comments:
o I think the authors implicitly assume that
$r(s,a) \in [0,1]$ (as in the UCRL2 paper, we need it to ensure that $Tt$
is an upper bound on Qfunction in Figure 3), however, it seems that the
authors have not explicitly mentioned it in this paper. o In line 169,
the definition of $V_t^{\pi}(s)$, the minimization and expectation should
be over $P_t$ to $P_{T2}$. Specifically, the states and actions from step
$t$ to $T1$ do not depend on $P_{T1}$. $P_{T1}$ influences the
transition to $s_T$, which does not show up in the RHS of the
equation.
Q2: Please summarize your review in 12
sentences
This paper considers learning in robust MDPs and
proposes provably efficient algorithms for it. The results seem to be
novel and significant.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for providing very helpful
comments. We address the comments to each reviewer below.
1.
Assigned_Reviewer_6 suggests a "natural setting" where one learns to act
in a robust setting without knowing even the uncertainty sets. We note
that this is one of our ultimate goals in learning robust MDPs. We believe
that with the current results, we contribute a significant first step
towards this goal.
2. The followings are to clarify some other
issues raised by Assigned_Reviewer_6:
(a) "The adversary only
chooses the worstcase transition".  While the "value" of a state is
defined based on the worstcase choice by the adversary, the adversary is
free to choose other transitions (it can even be historydependent). In
fact, the algorithm takes advantage of adversaries that do not always
choose the worstcase transition.
(b) "The distinction between
stochastic and adversarial states is artificial".  While a
purestochastic state is a special case of general adversarial ones, the
shift to more general adversarial behavior involves significant
qualitative change. For example, the Markov property may no longer hold.
We show an example of this in Figure 1. Indeed a similar setup that aims
to handle an unknown system that can be either adversarial or stochastic
has been studied in the bandit setting in Bubeck et al 2012.
(c)
"Experiment not reproducible, not wellmotivated".  The example we
show is a scenario in robust MDPs where a system deviates from a typical
(wellmodeled) behavior (e.g., an unmodeled system failure). The results
can be easily reproduced over a wide range of parameters and we only use
standard versions of the algorithms. We will include extra details in the
appendix to enable straightforward reproduction of the exact quantitative
results. We would like to remark that the submission is mostly a theoretic
one and the simulation is mainly for a proof of concept.
3.
The followings answer some questions by Assigned_Reviewer_8:
(a)
The range of Q  Indeed we assume that the reward of each step is
bounded, in [0,1]. Therefore the value of Q_t is at most (Tt) since there
are only (Tt) steps left. We will make this explicit.
(b) \hat{P}
is defined in lines 234237, and yes it is the empirical nextstate
distribution based on past transitions.
(c) The stochastic check
is roughly the difference between the *cumulative* "averaged" optimistic
value and the *cumulative* "1step" optimistic value. In principle, after
n steps this can be as large as T*n (i.e., linear in n), and certainly
larger than V_max. The stochastic check, however, tests this against
T*\sqrt{n}  this would be satisfied by a purestochastic transition with
high probability, but not necessarily by an adversarial one, regardless of
how big n is. We hope this clears any confusion.
4. Finally,
to Assigned_Reviewer_9, we note that 2(c) and 3(a) above have addressed
some of your comments. And thanks for pointing out the glitch in line 169.
 