NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7857
Title:Reinforcement Learning with Convex Constraints

Reviewer 1

The paper presents a way to solve the approachibility problem in RL by reduction to a standard RL problem. It casts this problem as a zero-sum game using conic duality, which is solved by a primal-dual technique based on tools from online learning. The proposed algorithm assumes an oracle that approximately solves a standard RL problem. It runs primal-dual iterations, where the dual part of the algorithm updates measurement weights according to the current primal solution obtained from the oracle. Originality: This work introduces a new problem of finding policy those measurements vectors lies inside a convex target set. The presented approach largely builds on the ideas from Abernethy et al. The high-level algorithm (Algorithm 1) and the performance bound (Theorem 3.1) are known in the literature and are similar to those of Le et al. (Algorithm 1, Proposition 3.1) with the conic duality being used instead of Lagrangian duality. The proposed algorithm is novel to the RL community. The theoretical performance guarantee is established (Theorem 3.3) that relates the approximation errors at its subroutines and the number of rounds to the performance gap. Quality: The paper is theoretically sound. It supports the proposed algorithm with performance guarantee and the proofs are clearly presented. However, I have doubts regarding the practical feasibility of the algorithm that involves approximately solving an RL problem at each iteration. Even though authors describe computational shortcuts in Section 3.4, the effective running time and required memory resources is not reported in the experiments. Related concern is about the comparison at a fixed the number of trajectories (Figure 1). If my understanding is correct, the proposed algorithm does much more computation at a fixed number of trajectories than RCPO that only does 1 gradient descent step at each transition. Commenting on experiments, line 284 states that the algorithm achieves a “larger reward faster by the baseline”, but it is unclear if it is due to higher complexity of ApproPO. Moreover, it was not the part of the objective. Clarity: The paper is well-written, but it might not be straightforward for a reader from RL community. The paper would benefit from better the link/integration with the RL literature; explanation of the policy space convexification as it is not standard; explicitly stating time/space complexity of the algorithm; explicitly stating the two problems studied: the feasibility problem and distance minimization in Section 2 and link them to relevant sections. Significance: This paper presents a novel approachibility problem that might be of interest to the RL practitioners. The paper introduces to the RL community an approach from Abernethy et al. that combines ideas of conic duality and online learning. Although, the practical feasibility of this algorithm remains a concern, it might encourage other researchers to pursue this direction.

Reviewer 2

As far as I am aware, the contribution of this paper is novel. The authors propose a novel method for dealing with constraints in RL by solving a minimax game. The paper is very well written with great attention to detail, and I believe this could be of interest to the RL community given the increasing interest in safety. There are a few details that were not completely clear, so I would appreciate the effort in trying to address my concerns: - line 79: what is Delta(pi)? I don't think this was properly introduced. - line 90: if z is a vector and C a set, how do you define a distance between them? Is is the distance between z and the projected point to C? If so, what is the impact on the type of projection being used (e.g., orthogonal, oblique)? - Are lambda and zeta vectors? Is the reward represented as the dot product of these vectors? If so, the reward is a still a scalar, so how is this different from specifying the constraints as an added term in the reward function of standard RL? This is not completely clear to me. - I would appreciate some more general comment on the formulation. How does this game-theoretic framework compare to other techniques like projected methods? For ex. PNAC [1]. Why should one choose a game-theoretic approach over projected methods? Granted, I am not very familiar with the game-theory literature, but it is not clear to me what the benefit would be. Could you elaborate on that? - Figure 1: why are results a function of trajectories? Is this the same as episodes? [1] Thomas P., Dabney W., Giguere S., Mahadevan S. Projected Natural Actor Critic (NIPS 2013)

Reviewer 3

After reading all the reviews and the rebuttal, I still think that it is a good paper. The authors have also addressed my concern about the running time of the algorithm. The paper considers reinforcement learning problems where the goal is to find a policy such that an expected measurement vector satisfies some fixed convex constraints. In the RL setting, this problem can be seen as an approachability problem. Therefore, after formulating it as a two-player game, the authors propose to solve it with a no-regret learning algorithm. Regrets bound are also derived from existing ones in online convex optimization. In the reproducibility checklist, the authors state that a link to a downloadable source code is provided, however I have not found any in the paper or the supplemental material. * Originality Recently several works have considered constrained RL problems. This paper tackles a more general framework than those previous works by accepting any convex constraints. The reformulation of the overall problem as a two-player game is interesting and relatively novel, although it is not surprising for someone familiar with online convex optimization (see Chap. 8 of "Introduction to Online Convex Optimization" by Hazan). I think the authors should compare their approach with the following work: @article{KalathilBorkarJain14, Author = {D. Kalathil and V.S. Borkar and R. Jain}, Journal = {Arxiv}, Title = {A Learning Scheme for {Blackwell's} Approachability in {MDPs} and {Stackelberg} Stochastic Games}, Year = {2014}} * Quality The theoretical results seem to be correct. The overall game-theoretic formulation and the proof techniques are based on results from online convex optimization (regret bounds) and from Abernethy et al. (distance as max of linear function, cone from convex set). It seems to me that some previous work (e.g., RCPO) on constrained reinforcement learning could have been easily extended to deal with convex constraints. It would have been nice to have some discussion about this. The experiments are a bit limited (only one small domain and one baseline). I think it would have been interesting to compare with other concurrent methods for constrained reinforcement learning (e.g., CPO or the work by Kalathil et al.). In Fig. 1, the plots for APPROPO end earlier than the other ones. I guess that the authors have not finished running them at the submission time. This is probably due to the fact that the proposed method is quite computationally heavy, I think some comments about the running times of APROPO and RCPO should be provided in Section 4. * Clarity The paper is well-written and clear. * Significance The advantages of the method is that it is a general scheme that can accept more general convex constraints and also it has some performance guarantees. The drawbacks of the approach is that the obtained solution is a mixed policy, which is not very convenient to apply in practice and also the computational costs of the method. In my opinion, these make APROPO more of a theoretical proposition, which may perhaps pave the way for more practical and efficient algorithms in the future. Some typos: - l.149: Z(\mu) is used for the first time here. Should it be \bar z(\mu)? - l.152: "This means ..." is only true if the problem is feasible? - l.155: must to implement - l.181: apprimately - l.260: we previous