NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Originality: The work puts together ideas and formulations from various areas. I believe that this work is an interesting extension of the seminal work on CVaR-based RL by Chow and Ghavamzadeh, but I would not consider it as particularly novel. It mainly makes use of the Chow and Ghavamzadeh machinery and proposes an interesting but relatively straightforward extension. 2. Quality: I believe the work is technically correct, even though I was not able to go through all details. Prior works are properly cited and the authors provide proofs to their claims. 3. Clarity: The paper is based on a number of previous works, so it is not easily accessible to an audience without the necessary background. But I think it is clearly written and knowledgeable readers should be able to follow the paper. 4. Significance: Learning options is quite important in the RL context, and robust options have been studied recently [Mankowitz et al.]. Even though I do not feel that the present work advances the theory of robust options in some major way, it nevertheless provides an interesting framework for robust options that additionally incorporate the CVaR concept. Some practitioners may find such ideas useful. UPDATE I have read the rebuttal and the other reviews, and I appreciate the authors' response to the points I raised. My overall feeling is that this is an interesting work in the field of reinforcement learning with robust options, and the experiments (old and new) look promising. With that said, all reviewers agreed that the novelty over Chow and Ghavamzadeh is not so significant. For this reason, I will keep my weak accept score, acknowledging that this a an above average submission but whose novelty is not particularly pronounced (at least in the present form of the submission).
Reviewer 2
The paper applies CVaR-based policy gradient method on robust option learning. The authors interpret the CVaR constrained objective as an unconstrained objective of a new MDP. It is interesting to see the connection and how that can simplify the understanding of the new policy gradient method. Experiments are done to understand the three questions mentioned in Section 4. However, correct me if I misunderstood, the difference between OC3 and Algorithm 1 in Chow and Ghavamzadeh[6] is just that OC3 are updating a set of \theta - {\theta_{\pi_omega}, \theta_{\beta_\omega}}_{\omega \in \Omega}, \theta_{\pi_\Omega} - yet Chow and Ghavamzadeh is updating one \theta. Other than that, the two algorithms are almost the same. Meanwhile, the MDP defined in section 3.2 is also quite similar to the augmented MDP mentioned in Section 5.1 of Chow and Ghavamzadeh [6]. Although applying existing robust MDP methods to robust option learning is important, it is unclear to me what the originality is of the proposed algorithm. Re: Author Response Thanks for clarifying my concerns. I have adapted my score accordingly. Typo: - L149: the expected loss R'' can be written as...
Reviewer 3
It seems that the assumption of P being structured as a Cartesian product is crucial for the derivation. However, it was not discussed why the assumption is valid. It seems equally possible that a distribution was chosen at the beginning and then carried through for each state. It would be helpful to explain why augmented states and rewards are necessary. Eq. 5: y should be \nv
Reviewer 4
This paper provides an interesting robust option learning formulation that's based on cost minimization and CVaR constraint, and propose an option-critic algorithm for the CVaR MDP problem. The authors also provided some derivations of robust option learning algorithms (that's based on soft robust loss function in equation 1) in the appendix. While I think these robust option-critic algorithms and derivations can potentially be solid contributions to the hierarchical RL (HRL) community, I do have the following questions/concerns: 1) While I understand the paper is proposing the CVaR option learning algorithm, how're the derivations of option-critic w.r.t. soft robust loss related to the CVaR option critic algorithm? Without looking further into the literature, it seems that the soft robust option-critic algorithm is also something that the authors are trying to propose/analyze in this paper? Are there any relationships between these analysis and the CVaR augmented MDP? 2) Without further understanding of the analysis (from the above question), it seems the contribution of this work is quite incremental. Because when the CVaR MDP is expressed as an augmented MDP, why can't one just apply Bacon's option-critic algorithm for risk-neutral MDPs to solve this problem? The CVaR MDP augmentation technique has also been proposed by Chow 2014 (which referred to some other earlier papers, such as https://link.springer.com/article/10.1007%2Fs00186-011-0367-0 for similar techniques used in the special case of finite horizon problems). So it seems the novelty here is only to extend this CVaR MDP framework to include options. 3) While the experimental results are positive in showing that CVaR option-critic outperforms most other baselines, most benchmarks are simple mujoco control task. Can the authors provide some motivations on why option-learning are needed here? I understand that for the hopper ice-block problem, using option framework probably makes sense, and I appreciate the in-depth study that is included in the Appendix. But for other tasks such as halfcheetah and walker, why can't we just solve the problem with standard CVaR PG, such as ones from Tamar 15 and Chow 14? If so, how does the performance of the vanilla CVaR PG algorithm compare with the option-critic counterpart? That being said, in general this work potentially has some good contributions to robust option learning. The following score is merely reflecting my above concerns/questions. I would consider adjusting that based on the authors' responses.