Summary and Contributions: This paper proposed a new hierarchical RL method with two-level controllers(a high-level controller and a low-level controller). They build a Sparse Graphical Memory(SGM) to obtain an abstract(smaller) MDP from the original MDP, which reduces the difficulty of the planning problem. The high-level controller plans on the SGM and produces sub-goals(waypoints) for the low-level controller. The main contribution is that they propose a novel node-merging method called two-way consistency(TWC). The experiments show that SGM with TWC and memory cleanup can increase success rates of navigation tasks. I have read the reponse, and I will keep the same score on this paper.
Strengths: The method(SGM+TWC) proposed in this paper is reasonable and the author also gives out the proof for their theorem. The experimental results are good and the ablation study shows some of the properties of their method.
Weaknesses: 1.It seems that they only evaluate their algorithm on a single map of ViZDoom and SafetyGym. Lacking of abundant experimental evaluations makes their method less convincing. 2.When memory graphs or tasks gets more complex, it's impossible to use an optimal search method to plan on the graph. The proposed algorithm may have its limitation when extended to more challenging tasks.
Correctness: The claims and method are correct. The empirical methodology is correct.
Clarity: The paper is well written and easy to understand.
Relation to Prior Work: Yes.
Additional Feedback: 1. Evaluate on more maps. 2. Incroporate more powerful high-level planners or construct a multi-level(instead of two-level) control paradigm.
Summary and Contributions: The paper proposes to solve long horizon goal-conditioned visual-navigation tasks by constructing a sparse graph of observations from the replay buffer. The key contribution is a two-way consistency criterion to aggregate similar nodes in the graph. The resulting sparse graph is shown to be more reliable and amenable to correction at execution time.
Strengths: The paper studies the important approach of combining existing reinforcement learning approaches with more classical planning elements so as to solve long-horizon problems with variable goals. The basic idea of abstracting a sparse graph from the observations seems sound. The paper attaches code to aid reproducibility.
Weaknesses: The main novelty of the paper is an approach to aggregate related observations together to construct a sparse topological map; many previous works have also explored this idea, such as [1, 2, 3], which are not addressed in the paper. The paper makes no distinction between state and observations. The fact that high-level actions are also observations makes the presentation confusing. Furthermore, this limits the class of tasks that can be addressed. The main theorem in the paper make strong assumptions on the Q function (see correctness) The comparison to SPTM is unclear. The results reported in the original SPTM paper are much better than the ones reported here. Is it because of the subsampled observations? If so, why is this a valid comparison? Also, in SPTM, some care was taken to ensure that the approach generalized to unseen environments; there was a training environment, some validation environments and test environments. In this paper, there does not appear to be any attempt to do this. The appendix describes 3 different perceptual models used for perceptual consistency as three different lower level policies trained for each separate environment. The importance of this perceptual similarity is not addressed in the body of the paper. The Q function is typically smaller than models used for perceptual embeddings, so it seems unlikely that its faster computationally to compute a perceptual embedding. Presumably directly using Q function similarity leads to poor performance and so this environment specific embedding was required. There are a substantial number of hyperparameters tuned for each environment, which raises questions as to the generality of the approach. Not all results have confidence intervals included. Details of seeds for different runs are not described.  Neural Topological SLAM for Visual Navigation, Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, Saurabh Gupta, CVPR 2020  A Behavioral Approach to Visual Navigation with Graph Localization Networks Kevin Chen, Juan Pablo de Vicente, Gabriel Sepulveda, Fei Xia, Alvaro Soto, Marynel Vazquez, Silvio Savarese, RSS 2019  Scaling Local Control to Large-Scale Topological Navigation Xiangyun Meng, Nathan Ratliff, Yu Xiang, Dieter Fox, ICRA 2019
Correctness: The central theorem of the paper, Theorem 1, appears to be correct, but does not quite validate that “goals can be reached with almost the same number of steps in both the aggregated and dense graphs when TWC is used, i.e. the resulting plans are near-optimal” [L71]. The theorem shows that plans are near optimal if we have an accurate Q function – in which case you can already accomplish the task perfectly. Instead, a theorem showing optimality bounds of the proposed approach, given an existing optimality bound on Q would be more useful.
Clarity: The paper is generally easy to read but too many of the details, in particular, with respect to perceptual similarity are left to the appendix. In general, I was confused as to precisely how the testing was carried out.
Relation to Prior Work: The paper is missing comparisons with a number of recent papers exploring topological memory (see weaknesses). The paper does not discuss differences between the proposed approach with the listed papers.
Additional Feedback: -- Post Rebuttal -- Thank you for the clarifications.
Summary and Contributions: This paper attempts to combine the strengths of deep learning and classical planning methods in a scalable manner, to solve long-horizon visual navigation tasks. The authors introduce Sparse Graphical Memory (SGM), an algorithm to sparsify a graph consisting of observations of an agent as nodes and the transitions as edges. The key idea in the algorithm is that two nodes/observations are aggregated if they are both interchangeable as goals or starting states. The authors demonstrate the effectiveness of SGM on three different environments, to outperform two baseline methods.
Strengths: Previous works such as SoRB and SPTM build a graph of replay buffer with observations as nodes and transitions as edges and use graph-based planning to find waypoints between the start and goal states and low-level neural controllers to traverse between waypoints. This approach is not scalable to large environments as the graphs grow quadratically. This paper introduces an algorithm to prune the nodes in this graph to ensure scalability and simple heuristics to cleanup erroneous transitions. The pruning algorithm is derived by extending the notion of value irrelevance to goal-conditioned RL: two nodes are redundant if they are both interchangeable as goals and start states, according to the goal-conditioned value function. The authors perform ablation studies to demonstrate that the improved performance is due to the effectiveness of the proposed method.
Weaknesses: - Although it is remarked that long-horizon visual navigation tasks are challenging for MPC, policy optimization and Q-learning algorithms, SGM is not compared with such methods on these tasks. For example, Hindsight Experience Replay. - Completely different network architectures and hyperparameters are used for each task.
Correctness: The empirical methodology is same as that of previous works. Use of different settings for each benchmark task is concerning.
Clarity: The paper is well-written and easy to read.
Relation to Prior Work: Relation to prior work is clearly discussed in Section 2.
Additional Feedback: - SPTM originally uses human demonstrations to bootstrap the graph. How does that influence the results presented in this paper? - The paper does not work with some PDF readers and page 13 in the Appendix is missing. Update after author response: I acknowledge that I have read the rebuttal.