NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4976 Budgeted Reinforcement Learning in Continuous State Space

### Reviewer 1

The paper formulates a budgeted Markov decision process (BMDP) able to deal with large search spaces. The problem is framed as a multi-objective MDP considering the rewards and cost functions. This creates value functions for rewards and costs, and the framework searches for a minimum cost policy, within the maximum reward policies that satisfies the budget restrictions. The paper experimentally shows the performance of the proposed approach in three domains: corridor, spoken dialogue, and autonomous driving. The framework searches for a minimum cost policy given a maximum reward policy satisfying budget restrictions. In principle it could be framed the other way around searching for a maximum reward policy given minimum cost. Add a note of why this is not a good idea. When creating batch samples what happens when the budget is zero? The episodes ends with a negative reward? The proposed framework is not a contraction so in principle is not guaranteed to converge. The authors provide some possible explanation of why this was not an issue in the selected domains. In which cases should we not use the proposed approach? The convex hull policy algorithm is not very clearly described. How can the budget lie between two Q_c, is it not the case that all of them should respect the budget constraint? Typos: - provided Appendix A

### Reviewer 2

SUMMARY: The authors considered budgeted Markov decision processes (BMDPs), which contain a notion of risk and a threshold to control that risk. Here, the authors seek to extend methods for BMDPs such that they can be applied to problems with continuous state spaces and unknown dynamics. In particular, they propose new extensions of deep reinforcement learning (DRL) methods such that they can be applied to BMDPs. The authors validate their approach on two simulated applications. STRENGTHS: * The paper is well motivated in that the authors seek to extend BMDPs to for continuous-state, infinite-horizon tasks, which could be an important step in trying to address some existing problems with "modern" RL approaches. * The authors make BMDPs in continuous-state, infinite-horizon tasks possible by providing a new algorithm, BFTQ, that enables learning in such cases. * The parts of the algorithm I understood seemed intuitive, and I found the figures and supplemental visualizations of experiments intuitive and compelling as well. WEAKNESSES: (A) To my read, the discussion of the algorithm and implementation are not quite complete in the sense that I still have several important questions that I did not feel were answered: (1) I am confused as to why/how the \beta evolves during the learning process and how this effects the goals one has when using a BMDP. Do we not care about how much risk is induced (or negative effects experienced) during training? Moreover, if the policy selects a new \beta at each new time step, how do we enforce a *specific* \beta when we wish to *use* the policy? (2) The convex hull procedure seems critical to being able to actually use the algorithm, but the explanation is lacking any intuitive interpretation. For example, it's not clear to me from 4.1 and Algorithm 3 exactly how "Budget \beta [is] always respected" (comment on Algorithm 3, line 9). Can the authors provide more explanation/intuition about what is going on here? (3) The authors state that "The pseudo-code of our exploration procedure is shown in Algorithm 4 in Appendix B." Since this component of your algorithm is one of the main hypotheses that is being validated, it should appear in the main paper. Suggest swapping for Algorithms 1/2 since these are basic extensions of existing techniques. (B) There is a mismatch between the introduction of the paper, which speaks generally to using BMDPs, and the actual experiments run, which show the benefit of using BMDPs *compared to* computing a set of solutions using existing techniques like FTQ(\lambda). The introduction needs to mention that approaches like the latter *are* available solutions and frame the contribution of the paper rather as one of providing a "better" solution in whichever way the authors feel this is best described (more-efficient, etc.). MINOR COMMENTS: * It seems that else at the beginning of Algorithm 3, line 9 doesn't belong there. * Several times in the paper, it is mentioned that experiments are done in "two environments," but aren't there three? * The definition in (2) is odd given that you say the "budget evolves as part of the dynamics." That is, if \beta'=\beta_a (as suggested by the Dirac function, then it is merely whatever the action says it was, correct? Why is that part of "the dynamics?"

### Reviewer 3

Although building heavily on previous work (Boutilier and Lu, 2016), this paper makes a novel effort in proposing the BMDP framework as a concrete solution to safe RL. This includes theoretical results (both positive and negative) and practical algorithms. The paper is well written, but could be organized better (e.g., page 3 is just a sequence of formal statements without any cohesive text). Clarity could be improved: since the notation is particularly heavy, important quantities should be called by their name more often. My main issue is with the presentation of the results on the optimality operator. The fact that T is, in general, not a contraction, should be discussed in more depth in the paper, since it determines the legitimacy of the later proposed algorithms. In fact, it is shown in the appendix that T is a contraction when restricted to "smooth" value functions. Unfortunately, this is presented in the paper as a mere conjecture, for reasons I do not understand. Other issues: 1. The exploration strategy from Section 3.2 is merely described, without enough intuition on why it is needed and how it is expected to solve the problem 2. The implementation proposed in Section 4 seems to rely heavily on results from Boutilier and Lu, 2016, but a lot of the necessary context is missing. Moreover, a high-level description of Algorithm 3 would be useful. --- After discussion: given the new theorem, I raised the score. I expect the authors to improve the presentation of the proposed algorithm.