Reviews: Curriculum-guided Hindsight Experience Replay

The paper borrows tools from combinatorial optimization (i.e. for the facility location problem) in order to select hindsight goals that simultaneously has high diversity and also being close to the desired goals. As mentioned, the similarity metric used for the proximity term seems to require domain knowledge that euclidean distance works well for this task. This may be problematic if we have obstacles that mislead the euclidean distance, or in another environment where it is less obvious what the similarity metric can be. I am aware that this dense similarity metric is only used for hindsight goals, and that the underlying Q function/policy is still trained on the sparse reward (without the bias). There are several related works that can be discussed and potentially benchmarked against in terms of hindsight goal sampling schemes: Sampling from ground truth distribution half the time for relabeling, and using future the other time (in Appendix). A. Nair, et. al. Visual Reinforcement Learning with Imagined Goals. NIPS 2018. A heuristic goal sampling scheme: D. Warde-Farley, et. al. Unsupervised Control Through Non-Parametric Discriminative Rewards. ICLR 2019 The paper supported its claims on the Goal and Curiosity Driven Curriculum (GCC) learning with qualitative plots of the selected hindsight goals over the course of training. The plots seems to indicate that indeed the earlier episode hindsight goals have higher diversity while latter episodes are closer to the desired goals. The ablation studies on the lambda_0 value indicates that having both the diversity and proximity terms can affect the performance. To prove that lambda curriculum is necessary, I think that it will also be helpful to compare different *fixed* value of lambda (i.e. no curriculum) vs with the lambda curriculum. The paper is fairly clearly written, with understandable high level ideas. There are some clarification details/suggestions: What is the similarity metric used, Equation 1 or 2 (or neither) for the experiments? What is the \eta and \lambda_0 value? This will also tell us how large \lambda gets by the time by the end of training. How sensitive is the performance to this parameter? Figure 4 gives a nice qualitative view of the selected achieved goals in relation to the desired goals and achieved goals. Having a quantitative value can also be valuable, i.e. plotting the value of F_prox(A), F_div(A), over the course of training. The large performance in the hand manipulate pen rotate task deserves some attention, as previous approaches so far have not been able to make much improvement. While the method seems more of a heuristic, I think that the approach proposed will benefit the goal-conditioned RL community. *** Post Author Rebuttal Comments *** Thank you to the authors for their response. I am fairly satisfied with the author response: - Given the additional ablations in their rebuttal that I have specifically asked for (i.e. with the fixed \lambda, sensitivity on the \eta parameters, F_div/F_prox curves), they have proved the importance of having the curriculum to balance the proximity/diversity (exploit vs explore). - On using Euclidean distance proximity, the authors also reasonably addressed this in their rebuttal, especially emphasizing that the main contributing point is on balancing between the proximity/diversity, *given* a metric. I will still encourage them to have a discussion about having the right metric for the domain in their final draft. There is unfortunately probably not enough space in the publication format to do a thorough investigation with other metrics / domain spaces other than L2, so I am ok to leave that to future work. - On related works: as they pointed out the works I mentioned are not directly comparable (i.e. using image input instead of object position state representations), but those works also touch on balancing some sense of diversity in the hindsight goals (i.e. via sampling from prior) versus the using seen states. So I did not expect them to try to directly compare by applying CHER to domains with image inputs. Overall I increased my score from my original review. *************

Reviewer 2

Summary: The paper is based on the observation that experience sampling in HER is inefficient because certain pseudo-goals are irrelevant to the actual goal. It proposes to train a subset of experiences that maximizes both proximity to the actual goal and diversity/representativeness (described by the facility location function). In addition, it scales up the proximity coefficient exponentially throughout training, so that HER converges more quickly. The final algorithm, termed CHER, outperforms the baseline DDPG significantly on 2 out of 4 hand-manipulation tasks. Strengths: 1. Novel idea on balancing exploitation (near-actual-goal sampling) and exploration (diverse goal sampling) for HER. 2. CHER clearly attains better performance than HER on the proposed tasks. 3. The paper is well-written and easy to follow. Weaknesses: 1. There is no discussion on the choice of "proximity" and the nature of the task. On the proposed tasks, proximity on the fingertip Cartesian positions is strongly correlated with proximity in the solution space. However, this relationship doesn't hold for certain tasks. For example, in a complicated maze, two nearby positions in the Euclidean metric can be very far in the actual path. For robotic tasks with various obstacles and collisions, similar results apply. The paper would be better if it analyzes what tasks have reasonable proximity metrics, and demostrate failure on those that don't. 2. Some ablation study is missing, which could cause confusion and extra experimentation for practitioners. For example, the \sigma in the RBF kernel seems to play a crucial role, but no analysis is given on it. Figure 4 analyzes how changing \lambda changes the performance, but it would be nice to see how \eta and \tau in equation (7) affect performance. Minor comments: 1. The diversity term, defined as the facility location function, is undirected and history-invariant. Thus it shouldn't be called "curiosity", since curiosity only works on novel experiences. Please use a different name. 2. The curves in Figure 3 (a) are suspiciously cut at Epoch = 50, after which the baseline methods seem to catch up and perhaps surpass CHER. Perhaps this should be explained.

Reviewer 3

Paper ID:	6872
Title:	Curriculum-guided Hindsight Experience Replay

The empirical results were difficult to interpret. Some of this difficulty was due to ambiguity of the results themselves, but the rest could have been addressed in the discussion. The lack of consistency in results across tasks, coupled with a weak discussion of these experiments are concerning. However, that the lambda=1 setting consistently outperforms the baselines and ablations suggests that the general technique is worth trying as a drop-in replacement for uniform sampling in experience-replay UVFA settings. Analysis includes a proof on the bound of the sub optimality of the sampling strategy (which is nice) but it would have been helpful to include an empirical evaluation what effect sub optimality in sampling actually has on agent performance. Is performance highly dependent on getting sampling just right, or is anything that makes sampling more greedy sufficient? Relatedly, it would have been interesting to compare with simpler prioritized-replay mechanisms, e.g. directly using the goal-similarity metric in a priority queue, particularly since this is easy to implement. "A diverse subset A of achieved goals encourage the agent to explore new and unseen areas of the environment and learn to reach 146 different goals." > In a HER/UVFA setting it seems that the online choice of goal is the biggest factor in determining exploration, vs. what is backed up offline. I'd expect that in some cases sampling diverse goals could actually decrease exploration for a given goal by removing delusions in the value function. Overall I think the connection between prioritized sampling, which this paper focuses on, and the exploration-and-exploitation trade-off, which is typically viewed as an online choice or a reward augmentation, is tenuous and warrants further discussion. In Fig 3) CHER offers no benefit over HER and HEREBP baselines for tasks b & c, but is significant on tasks a & d. To what do the authors attribute this difference? I also find it suspicious that the HER and HEREBP traces are nearly identical for all experiments, and even have similar dips in what is typically noise. In Fig 4) tasks c & d, which are the harder and more interesting tasks, the effect of lambda is quite small, which suggests that the benefit for fig 3d vs DDPG-HER is mainly an effect of vs , and not the balance of greedy or diverse sampling. Might this be explained by other parameters or tuning for the baseline implementations? This result seems particularly surprising since in fig3a, which is a similar task, CHER and DDPG-HER had equivalent performance. To what can we attribute this inconsistency? On tasks a & b lambda 100 > 0 and 10 > 0.1, suggesting that 1 is optimal, but that being too greedy is better than being to diverse. This is an interesting result, but doesn’t seem to hold in c&d. Plot titles and axis legends difficult to read. What is the compute cost of the proposed method? Running an iterative optimization inside the batch-sampling step of a Deep-Rl algo sounds expensive.

Reviewer 1

Reviewer 2

Reviewer 3