Paper ID: | 8807 |
---|---|

Title: | Goal-conditioned Imitation Learning |

After rebuttal comments: the authors address many questions and propose some updates to improve the quality of the paper. In my view these are minor and doable. Assuming this adjustments my overall score is increased. ________________________________________________ Originality: the paper builds on previous work and ideas used in those works (HER, BC, DDPG, GAIL). The paper argues how learning can be sped up in goal oriented sparse reward problems using demonstrations. In previous work (HER) data relabeling has been exploited and now applied for BC. Additionally, goal oriented demonstrations are combined with GAIL (goal oriented GAIL) and DDPG to learn an appropriate policy. Ultimately, the policy gradient of BC can be combined with DDPGs based on discriminator rewards in GAIL. While the paper considers different combinations of existing works it mostly convincingly shows how demonstrations can speed up learning especially in "bottlenecked" tasks. The paper does not discuss literature that purely considers demonstrations for goal oriented robot learning (on real robots) [1] Quality: the overall quality of the paper is OK, but some details remain vague. The idea of using sparse rewards depending on goal states limits the range of tasks where these kind of approaches are applicable. For example, for the FetchSlide task, or for a simple throwing task the state-dependent sparse reward will not work in my opinion. The paper also claims that scaling to real world robotic tasks of goalGAIL is potentially possible. It is difficult to see how, given the data-hungry RL setting (at least 10^6 step until convergence). While the paper convincingly shows different positive aspects of exploiting demonstrations, it is not clear how these demonstrations were collected for the simulated tasks. Clarity: the paper overall reads well. However, the techniques used are not always clearly introduced/explained, (examples: what is the objective J in the policy gradient in Sec. 4.1, also use equation numbers; say clearly what does the discriminator corresponds to when introducing GAIL, what's the main idea behind GAIL?) I'm afraid the reader requires a substantial knowledge of related literature for understanding the technical approach due to the lack of clarity while introducing the used work. Minor suggestions: use goalGAIL or gGAIL in the figures to highlight your proposed approach. Significance: the paper overall carries an interesting message, but the clarity and some vague technical details make its impact a bit hard to assess. Nevertheless, in my opinion the paper shares enough value for the research community and others may be motivated by and use the work. [1] Calinon: A Tutorial on Task-Parameterized Movement Learning and Retrieval, Intelligent Service Robotics 9 (1), 1-29, 2016

Post-response comments: I have read the response and it was informative. The new tasks are a good addition. ----------- -What do you mean with quasi-static tasks in Section 4.2. It could be a number of different things and I’m not sure I captured which one it refers to. -Using state-only trajectories instead of s-a pairs trajectories certainly seems convenient for being able to operate with demonstrations from different sources and for generalization to different agents with different properties. At the same time, I wonder if there are negative effects of doing this, as the notion of the transition function is lost. - In Figure 1, was the system starting from the same initial state for each test to each goal state? - The writing was clear and easy to follow.

Even though the proposed expect relabeling technique can augument the training data to help alleviate the reward sparsisty problem, the introduced goals, which are intermediate states, are different from the groundtruth goal and this may introduce a large number of noises especially when the true rewards are limited. I don't know how to alleviate or avoid this problem or equivalently how to guarantee that the benefit of augumenting data will be larger than the negative effect of the introduced noises. Line 171: different than->different from ------------------------------------------------------------------------------------------ Authors' response partially clarifies my concern.

# Originality This work is primarily built on Hindsight Experience Replay (HER), Behavioural Cloning (BC), and Generative Adversarial Imitation Learning (GAIL). It combined these three works by additionally proposed a new expert relabeling technique. Besides, by replacing the action by the next state they can train the algorithm with state-only demonstrations. Although the novelty of this paper is a little incremental, it combined all the stuff in a reasonable fashion. # Quality In general, the method proposed in this paper is technically and experimentally sound. On both sides, however, I still have a few questions. First, in Section 4.2, you claimed that the key motivating insight behind the idea of relabeling the expert is that ''if we have the transitions (s_t, a_t, s_{t+1}, g) in a demonstration, we can also consider the transition (s_t, a_t, s_{t+1}, g'=s_{t+k})''. Did you have a rigorously mathematical proof of this statement under which condition it is the case? Because it is well known that a globally optimal policy is not necessary to be locally optimal at each step or each sub-trajectory. Second, on the experimental part, do you have any explanation about why GAIL+HER w/wo ER is better than BC+HER w/wo ER in continuous four rooms but is worse in fetch pick&place? # Clarity The paper is generally well written. However, in order to be more reader-friendly, the authors had better reorganise the layout to keep texts and their corresponding figures/tables/algorithms on the same paper as much as possible. For example, Algorithm 1 is presented on Page 1 but its first mention is on Page 5. In addition, the paper has some minor typos: e.g., is able -> is able to (line 57); U R -> R (line 5 of Algorithm 1); form -> from (line 109); etc. # Significance The problem explored in this paper is important and the authors proposed a natural but reasonable solution to it. Built on this, I believe there are still some other directions worth exploring. For instance, as we seen in Figure 4, BC and GAIL with ER and without ER perform quite differently in the two tasks. There must be some reason behind this phenomenon. Is it task-specific or method-specific? If method-specific, what causes the difference? etc. All this should be of great interest and of assistance to the community.