Paper ID: | 9003 |
---|---|

Title: | Learning Reward Machines for Partially Observable Reinforcement Learning |

The authors propose a novel approach for solving POMDPs by simultaneously learning and solving reward machines. The method relies on building a finite state machine which properly predicts possible observations and rewards. The authors demonstrate that their method outperforms baselines in three different partially observable gridworlds. Overall, I found the paper clear and well motivated. Learning to solve POMDPs is a very challenging problem and any progress or insight has the potential to have a big impact. My main concern about this work concerns the scalability of the discrete optimization problem. The experimental results don’t provide much insight in that regards. How long does solving the LRM take in these gridworlds? How often are the LRM recomputed in practice? While better learning curves do confirm that the algorithm works, they don’t usually provide much insight about how algorithm behaves. Additionally, without more information about the domain, it is difficult to appreciate the performance difference. For instance, without knowing how many steps are required to traverse the rooms, I can’t know if the 10 steps of history given to DDQN is a too much, enough or too little. I would expect that a sufficiently long history would enable DDQN to solve the task. Have the authors tried this? While I am disappointed by the empirical evaluation, the conceptual and theoretical contributions has me leaning towards acceptance. There is a considerable amount of work that focuses on improving discrete and hybrid optimization, and being able to make use of that is very appealing. I found the decision to replace the PSR’s optimization criterion (i.e., Markovian predictions) with the more limited constraints on possible observations quite interesting. Do the authors have any intuition about when this is not a good proxy? When would this lead to less desirable RMs? How does this work relate to [1] which also follows the PSR approach? Additionally, how does this scale when there are a large number of possible observations for some/all reward machine states? Misc comments: - Line 81, typo “[...] only see what _it_ is in the room that [...]” - Experiments, how many steps does it take to go from one room to another? - Line 294, 295, “the maximum reward”, is it meant to say the maximum cumulative rewards/return? - Why not evaluate each method the same way? This seems unnecessary and a bit odd. [1] Zhang, Amy, et al. "Learning Causal State Representations of Partially Observable Environments." arXiv preprint arXiv:1906.10437 (2019). Post-rebuttal: ----------------------- My concerns have been addressed and I recommend acceptance.

# Originality The work presents a well thought-out extension of RMs to POMDPs, and a learning algorithm to jointly learn them with state of the art model-free methods. The manuscript contains a sufficient literature review, and makes it clear about the distinctions and improvements from previous work. # Quality The overall quality of the work is high. The presented framework and learning method seem theoretical sound (at least under a quick scrutiny), and the experimental results are sound - albeit tiny in content and, possibly, analysis. That said, the authors are careful about their claims, and seem pretty forthcoming regarding the weaknesses of LRM. # Clarity The paper is very well written, and mostly clear w.r.t. its notation and presentation. The manuscript and its supplementary materials are dense of both theoretical and implementation details. I feel relatively confident myself and most other RL researchers could have a good attempt at reproducing the method and the experimental settings purely based on the writeup, however I hope that the authors will nonetheless release the code as declared in the manuscript. # Significance Reward Machines have yet to become part of the common set of tools used by RL researchers and practitioners, primarily because of how recent they are, and the relatively simple settings that have been used to evaluate them so far. This work doesn't fix this particular issue, however it demonstrates that for a class of common environment settings - i.e. partially observable, sparse-reward - they have the potential of being a good way to incorporate priors about the task structure into policy learning and inference, which is an extremely important topic of research in the RL community.

The authors apply Reward Machines (RMs) [26] to POMDP problems. While RMs is a automate-based representation of the reward function and originally is proposed for decomposing problems into subproblems, it is used for a kind of belief state representation. While RMs were given in the previous work, the paper proposes a method of learning RMs from data. The task as learning RMs is formulated as a discrete optimization problem. The authors also show the effectiveness of the proposed method with numerical experiments. The proposed method had much better performance than A3c, ACER, and PPO with LSTM. However, I have the following concerns. Concerns: [A] It appears to be non-trivial to prepare the labeling function L. It would be useful to give examples of the labeling function in several POMDP tasks. [B] Numerical experiments are weak. It would be required to experiment with various POMDP benchmark tasks or practical tasks. The baseline methods seem to be biased. I think that the proposed method can be regarded as a kind of the model-based approach for POMDP. So model-based method for POMDP like [*] should be included in baseline methods. [C] I could not follow Proof sketch of Theorem 4.3, especially the last sentence. I feel that there is some discrepancy between Definition 4.1 for the perfect RM and the setting of the objective value in (LRM), which is the sum over log(|N_*|). Minor issue: * In page 13, Theorem 3.* should be Theorem 4.*? [*] Doshi-Velez, Finale P.; Pfau, David; Wood, Frank; Roy, Nicholas. Bayesian Nonparametric Methods for Partially-Observable Reinforcement Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 37, Issue: 2 , 2015.