Paper ID: 413
Title: Reinforcement Learning in Robust Markov Decision Processes

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The paper proposes and analyzes a method for learning in robust MDPs. While this setting is very similar to learning in stochastic games, the main difference is that in robust games, the optimal move of the opponent is observed, while in robust MDPs the decision maker only observes the outcome (the opponent chooses the probabilities).

The paper makes a small advance on a relevant, non-trivial, and interesting topic, but I am not sure that it is quite ready for publication in its current form.

First, the setting is somewhat contrived and not motivated. A natural setting would be simply to use reinforcement learning to learn to act in a robust setting. Instead, the authors assume that the robust uncertainty sets are known in advance, while the stochastic probabilities are not known. This choice is not justified by neither applications nor by the difficulty of the natural case. The analysis is not very novel either, with Lemma 2 being the most significant difference with prior work. Addressing the natural setting would make this a much stronger paper.

There are several minor issues too.

The behavior of the adversary is not well-defined during the execution of the algorithm. As defined on page 4, the adversary should take the minimum-return transition probabilities for the current policy. Therefore, the transition probabilities for the adversarial states should only change after a policy change. The algorithm does not take advantage of the fact and seems to assume that the adversary can simply behave arbitrarily – or perhaps behaves in order to maximize the regret. It would be good to clarify this issue.

I find the distinction between adversarial and stochastic states artificial and unnecessary. Stochastic states are simply a special case of adversarial states with a single distribution in the uncertainty set. In addition, adversarial states may behave as stochastic if the worst-case distribution chosen by the opponent is always the same; it is impossible to tell the two apart in that case.

I did not find the experiment useful – it is not reproducible, not well motivated, and does not really show anything interesting. It may be better to remove it and replace it by some of the proofs that are currently in the appendix.

The proofs are correct – if a bit tedious – as far as I can tell. The presentation could be improved in emphasizing the main difference in the proofs in comparison with the standard stochastic MDP results.
Q2: Please summarize your review in 1-2 sentences
The paper makes a small advance on a relevant, non-trivial, and interesting topic, but I am not sure that it is quite ready for publication in its current form. The setting is contrived, the methods used are quite standard and very similar to ordinary analysis for stochastic MDPs.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
Additional comments after author feedback:

I thank the authors for answering my questions. I now understand how the proposed algorithm works. I do not have any concerns anymore regarding the stochastic check. I see how it may be eventually triggered.

Original review:

The paper describes a new reinforcement learning technique that can handle adversarial scenarios. This is a very important problem that has not been investigated very much. The proposed algorithm is to my knowledge the first that can learn whether each parameter is adversarial or stochastic and infer a robust policy based on that.

The paper is generally well written however, some important aspects of the proposed algorithm are not explained. I also have some concerns about the stochastic check.

In Figure 3, why is Q the minimum of T-t and a usual backup? If the rewards are always greater than 1, then Q will always be T-t and the algorithm won't be learning anything... I suspect that there is a simple explanation, but I'm not sure...

Also, in Figure 3, how is \hat{P} estimated? I couldn't find any description of how the transition distributions are estimated. Is it simply the relative frequency counts based on the trajectory observed?

I'm also having trouble convincing myself that the stochastic check is reasonable. It says that if the difference between the expected future rewards at (s,a) when computed in two different ways is greater than some threshold, then Pr(s'|s,a) must be adversarial. The threshold is an expression that depends on n and tau, which means that it grows over time. In fact, at some point it will be greater than the difference between Vmax and Vmin where Vmax = T*Rmax and Vmin = T*Rmin. This means that if the agent has not detected an adversarial behaviour by this number of steps, it will never find which parameters are really adversarial. This doesn't seem right. What will happen if the adversary behaves stochastically long enough to ensure that it is not detected and then reveals its adversarial nature after that. Would it manage to inflict arbitrary regret?

The empirical evaluation consists of a proof of concept since it includes only a single toy problem that was handcrafted to demonstrate that UCRL2 will fail to converge to a good policy. I encourage the authors to include a more thorough empirical evaluation with several problems, a variation in the degree of adversarial behaviours and to report the running times.
Q2: Please summarize your review in 1-2 sentences
The paper tackles an important problem: how to learn which parameters are adversarial vs stochastic in RL. The work is well motivated and interesting. Although the empirical evaluation consists of a proof of concept, the paper makes an important theoretical contribution. This is also the first work (to my knowledge) that learns to distinguish adversarial vs stochastic parameters in RL.

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see

In this paper, the authors consider a reinforcement learning (RL) problem where transitions from some state-action pairs can be governed by an adversary. They propose provably efficient RL algorithms for this context. The proposed RL algorithms are variants of the UCRL2 algorithm of Jaksch et al. 2010, with an additional "stochasticity check" to detect adversarial state-action pairs. The authors also prove that the proposed algorithms achieve sublinear cumulative regret relative to a reasonable minimax policy. I believe the results are novel and significant for two reasons. The problem addressed is distinct from those addressed in the literature, as far as I know. The paper is well-written and enjoyable to read, and the analysis seems to be rigorous and technically correct. Perhaps one weakness is that the proposed algorithms require knowledge of $U(s,a)$ (the action space of the adversary). It may be unrealistic to assume that the algorithm knows $U(s,a)$'s precisely at the beginning. The authors should also provide more information about the experiment (maybe in the appendix). It is not clear to me whether or not the current experimental results are representative of what one should expect to see more broadly.

Some technical comments:

o I think the authors implicitly assume that $r(s,a) \in [0,1]$ (as in the UCRL2 paper, we need it to ensure that $T-t$ is an upper bound on Q-function in Figure 3), however, it seems that the authors have not explicitly mentioned it in this paper.
o In line 169, the definition of $V_t^{\pi}(s)$, the minimization and expectation should be over $P_t$ to $P_{T-2}$. Specifically, the states and actions from step $t$ to $T-1$ do not depend on $P_{T-1}$. $P_{T-1}$ influences the transition to $s_T$, which does not show up in the RHS of the equation.
Q2: Please summarize your review in 1-2 sentences
This paper considers learning in robust MDPs and proposes provably efficient algorithms for it. The results seem to be novel and significant.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for providing very helpful comments. We address the comments to each reviewer below.

1. Assigned_Reviewer_6 suggests a "natural setting" where one learns to act in a robust setting without knowing even the uncertainty sets. We note that this is one of our ultimate goals in learning robust MDPs. We believe that with the current results, we contribute a significant first step towards this goal.

2. The followings are to clarify some other issues raised by Assigned_Reviewer_6:

(a) "The adversary only chooses the worst-case transition".
- While the "value" of a state is defined based on the worst-case choice by the adversary, the adversary is free to choose other transitions (it can even be history-dependent). In fact, the algorithm takes advantage of adversaries that do not always choose the worst-case transition.

(b) "The distinction between stochastic and adversarial states is artificial".
- While a pure-stochastic state is a special case of general adversarial ones, the shift to more general adversarial behavior involves significant qualitative change. For example, the Markov property may no longer hold. We show an example of this in Figure 1. Indeed a similar setup that aims to handle an unknown system that can be either adversarial or stochastic has been studied in the bandit setting in Bubeck et al 2012.

(c) "Experiment not reproducible, not well-motivated".
- The example we show is a scenario in robust MDPs where a system deviates from a typical (well-modeled) behavior (e.g., an un-modeled system failure). The results can be easily reproduced over a wide range of parameters and we only use standard versions of the algorithms. We will include extra details in the appendix to enable straightforward reproduction of the exact quantitative results. We would like to remark that the submission is mostly a theoretic one and the simulation is mainly for a proof of concept.

3. The followings answer some questions by Assigned_Reviewer_8:

(a) The range of Q
- Indeed we assume that the reward of each step is bounded, in [0,1]. Therefore the value of Q_t is at most (T-t) since there are only (T-t) steps left. We will make this explicit.

(b) \hat{P} is defined in lines 234-237, and yes it is the empirical next-state distribution based on past transitions.

(c) The stochastic check is roughly the difference between the *cumulative* "averaged" optimistic value and the *cumulative* "1-step" optimistic value. In principle, after n steps this can be as large as T*n (i.e., linear in n), and certainly larger than V_max. The stochastic check, however, tests this against T*\sqrt{n} -- this would be satisfied by a pure-stochastic transition with high probability, but not necessarily by an adversarial one, regardless of how big n is. We hope this clears any confusion.

4. Finally, to Assigned_Reviewer_9, we note that 2(c) and 3(a) above have addressed some of your comments. And thanks for pointing out the glitch in line 169.