NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5017
Title:Language as an Abstraction for Hierarchical Deep Reinforcement Learning

Reviewer 1

I believe the proposed method, HAL (Hierarchical Abstraction with Language), is an interesting approach for HRL. The authors adapt Hindsight Experience Replay for instructions (called Hindsight Instruction Relabelling). I have some concerns about the experimental setup and empirical evaluation of the proposed method: - The motivation behind introducing a new environment is unclear. There are a lot of similar existing environments such as crafting environment used by [1], compositional and relational navigation environment in [2]. Introducing a new environment (unless its necessary) hinders proper comparison and benchmarking. It seems to me that the environment was specifically designed to highlight the strengths of the proposed method. One of the most important motivations behind studying HRL methods is solving complex sparse reward tasks. I would have liked to see the proposed method applied to some of the most popular sparse rewards tasks such as Montezuma's Revenge, as it helps in gauging the significance of the proposed method as compared to several published methods evaluated on these tasks. If the proposed method can not be applied to standard HRL tasks, then it is a limitation of the method which should be discussed. - I believe the proposed method is similar to [1] and [3]. The authors should position their work with respect to [1] and [3] which would also serve as better baselines in my opinion. - Two HRL methods used as baselines in the experiments completely fail in the new environment proposed by the authors. Some explanation behind this result would be helpful. - The authors state that the high-level policy always (even in the diverse setting) uses ground-truth state as input, namely position and one-hot encoded colors and shapes. I think the ground-truth state consists of compositional features which make it very easy for the high-level policy to learn to output instructions, while the baselines can not leverage this. I believe this is an unfair comparison and allows raises concerns about the effectiveness of high-level policy with high-dimensional state space, especially because it is not tested on any standard environment. - The appendix is full of typos. For example, line 675, "Due tot", line 705 "do to time constraint". Section B.1, which is referenced several times in the main paper, seems to be incomplete. [1] Jacob Andreas, Dan Klein, and Sergey Levine. 2017. Modular multitask reinforcement learning with policy sketches. ICML-17 [2] Yu, H., Zhang, H., & Xu, W. (2018). Interactive grounded language acquisition and generalization in a 2d world. ICLR-18 [3] Oh, Junhyuk, et al. "Zero-shot task generalization with multi-task deep reinforcement learning." ICML-2017. ---- Updated after author response: After reading the author response and other reviews, I maintain my rating. This is due to the following reasons: 1) It seems to me that HAL is specifically designed for the proposed environment and not a general HRL method. The author response confirms that the proposed method is not general enough to be applied to any environment ("environments like Montezuma’s don’t have labeled data or infrastructures for complex language captioning"), however the introduction claims that HAL is a general HRL method, the first contribution stated is "a framework for using language abstractions in HRL, with which we find that the structure and flexibility of language enables agents to solve challenging long-horizon control problems". Specifically, the environment needs to provide whether each language statement is satisfied or not by the current world state. I believe the crafting environment implemented by the authors also provides this information. This is a very strong assumption and severely limits the applicability of an HRL method. These limitations are not acknowledged in the submission. I suggest reframing the introduction to introduce the task/environment and the challenges associated with it, and propose a solution for the specific task. 2) Based on the author response, I believe most of the gains over the baselines are coming from Hindsight Instruction Relabelling (as the authors also mention "DDQN is able to solve only 2 of the 3 tasks, likely due to the sparse reward" in Section 6.2, and in the rebuttal authors say "HAL significantly outperforms policy sketch because it is off-policy and leverages hindsight relabeling"). In my opinion, HIR is an adaptation of HER in the proposed environment and not very original. 3) The above also raises concerns about fairness in comparison with other methods. HIR requires specific information from the environment about whether each language statement is satisfied or not by the current world state. This makes the comparison unfair because baselines do not leverage this information from the environment. 4) I also agree with Reviewer 3's concern about high-level policy only chooses from a fixed instruction subset and therefore does not learn or output anything compositional. The additional results provided in the author response are significantly different from the original submission and require additional details.

Reviewer 2

Originality and Significance: The idea is natural and intuitive. Now that the authors have shown this idea works, there's a direct avenue for incorporating ideas from other work (i.e., generalization in visual QA) to improving RL. The authors did a great job finding the right setting (a reasonably compositional one) to showcase language's promise in RL (highlighted by the systematic generalization results). I know that several others have been thinking about this idea in general for a while (using language as an abstraction in HRL) - for example, see concurrent/later work "Hierarchical Decision Making by Generating and Following Natural Language Instructions" ( Regardless, it is great to see this idea actually work in a pretty challenging / sparse reward RL setting. One drawback of the implemented agent is that the high-level policy treats each instruction distinctly, which takes away from some of the story of aiding RL by exploiting the compositionality in language. Decoding instructions in a compositional manner would be fit better with the authors' aim; full autoregressive decoding would be impressive (but challenging), but it would even just be interesting to factor the action space compositionally (e.g., first predict the instruction template, then predict the key nouns/adjectives in the template, or perhaps using hierarchical softmax to decode actions). Right now, as I understand, the low-level policy treats the goal input as compositional, but the high-level policy does not treat actions as compositional. Quality: The work is well-executed. Task-design and model-design decisions are simple, clear, and well-motivated. The instructions themselves could be more diverse; I would've been more interested in seeing the authors experiment on highly compositional/diverse instructions (i.e., at the level of language complexity/compositionality of CLEVR questions) on the state-based environment rather than experimenting with simpler language instructions from pixel-based observations (since the paper's focus is on language). Clarity: The writing was quite clear overall. In general, I felt like the paper made many distinct points about how language could be useful; it would've been helpful to frame the intro/discussion/paper as focusing on 1-2 of these (i.e., compositional generalization), as well as being concrete about how language can help. For example, it seems that compositional generalization through language is a relatively unique/strong aspect of this work, while using language instructions to specify vague goals is a property of instruction-following tasks in general, not specific to using language in HRL. I only understood around Page 7 (experiments) why the authors concretely expected language to help with generalization (when the authors describe the explicitly non-compositional approaches); even then, I would've liked more explanation for why policies did generalize compositionally, in contrast with the expectation (Page 7): "From a pure statistical learning theoretical perspective, the agent should not do better than chance on such a test set." Minor writing comments: * "Fortunately, while the the size" -> "Fortunately, while the size" * "arrnage 4 objects around an central object" -> "arrange 4 objects around a central object" * Figure 4b: Maybe order the keys in the figure by the number of instructions / the performance in the graph (more intuitive/easier to read). And/or spell out "12k" -> "12000" so faster to tell what's going on (initially I was confused reading the legend) * Figure 5: The legend is pretty small * The appendix has a few typos as well

Reviewer 3

Thanks to the authors for the response - these new experiments certainly are a step towards better demonstrating the role of compositionality in this work. However, these experiments need elaboration and further analysis, especially since the formulation of the high-level policy is a new one. This changes the paper quite a bit and I feel would necessitate another round of evaluation. --------- This paper proposes to use instructions in natural language as a way of specifying subgoals in a hierarchical RL framework. The authors first train a low-level policy to follow instructions (using dense rewards) and a high-level policy to generate/choose instructions (using sparse environmental rewards). The goal is to enable better generalization of HRL algorithms and to solve the sparse reward problem. I really like the idea of this paper - it is definitely novel and worth pursuing and the paper is written clearly. However, the execution is a bit lacking and the experiments do not clearly demonstrate the advantage of using compositional language, which is the main premise of the paper. Re: compositionality: While the paper's idea of using the compositional nature of language to improve HRL is definitely interesting, the experiments do not back up the claims. First, the high-level policy only chooses from a fixed instruction subset and therefore does not learn or output anything compositional. Second, the non-compositional baseline for the low-level policy doesn't make sense. Why should a lossless representation be non-compositional, especially with a sequence auto-encoder? For fair comparison with the language representation of subgoals, the auto-encoder representation should also be fed into a GRU and not used directly? Further, another baseline would be to simply have a bag-of-words representation for each instruction (note that this is different from having a one-hot representation for each instruction). From the current experiments, it is unclear that there is an advantage to using language as an intermediate representation (even though theoretically it makes perfect sense!). Other comments: 1. Some of the experimental details are not clear. Do you assume access to the ground truth state of the world for the HIR procedure? If so, this should be made clear as an assumption. 2. Are the number of actions used for the high-level policy the same for your method (80) vs the other HRL baselines? From C.4, it looks like the goal space is much larger for the baselines. 3. (minor) How well does the model generalize to real instructions (with potential noise)? 4. Relevant work: a) Speaker-follower models for vision-and-language navigation D Fried, R Hu, V Cirik, A Rohrbach… - Advances in Neural …, 2018 - b) Grounding language for transfer in deep reinforcement learning K Narasimhan, R Barzilay, T Jaakkola - Journal of Artificial Intelligence …, 2018 - c) Vision-based Navigation with Language-based Assistance via Imitation Learning with Indirect Intervention Khanh Nguyen, Debadeepta Dey, Chris Brockett, Bill Dolan d) Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout Hao Tan, Licheng Yu, Mohit Bansal