Paper ID: | 824 |
---|---|

Title: | Multi-View Reinforcement Learning |

Originality: This work is innovative in generalizing markov decision process to multi-view scenarios. The authors have clearly distinguished their work from the state of the art and similar works. Quality: The submission is technically correct. The claims made in the paper that multi-view RL using policy transfer between views can achieve convergence faster has been adequately demonstrated. The works reads like a complete piece of work and the authors have performed thorough experiments to demonstrate the claims. They did discuss the case where the proposed solution did not better the competing method, providing justification on why that would be the case. Clarity: The paper skips over a lot of derivations and only a person who is very familiar with MVRL would be able to understand it without supplemental document. Otherwise, the paper is well written. However, without the supplementary material, it would be hard to replicate the experiments. Significance: The work seems relevant and significant to advancing the state of the art in reinforcement learning. There are increasing number of scenarios where multiple sensors sense the same environment. Therefore, multi-view RL is definitely going to be a hot topic of research.

This work proposes a simple but efficient extension of POMPDs to MV data by extending the observation/state space over the multiple views and modeling the long-term dependencies between the (latent) states. To deal with multi-view latent variables (states), a variational lower bound is derived and optimized in two settings: model-based and model-free (using run/nn-s). The optimization of the derived objective is not trivial and the authors propose various strategies to effectively deal with the dimension of this problem (including the policy transfer and few-shot-learning ideas). I find this work novel and very well presented. The paper is well written and easy to follow. The supplementary materials describe in the algorithm details making it easier to understand the model. The state-of-the-art and related approaches are well reviewed and compared with the proposed. The results on different RL environments show that the proposed can effectively learn underlying shared dynamics (due to the mv modeling), in contrast to the world model.

Originality: Considering that multi-view and multi-modal RL papers tend to offer ad-hoc solutions to the problem, this paper's formalization is a nice contribution. Quality & Clarity: The paper is well presented, the math is mostly clear, although some parts aren't obviously translatable to an implementation. In terms of experiments, it seems that many details are lacking, and as far as I call tell, all figures represent a single run of each setting, which is worrisome. Significance: While the contributed framework does seem like a useful formalism, this paper fails to convince me that it actually is: - The proposed experiment in 4.1 creates artificial views which don't seem representative of multimodal settings, in that they all contain the same *information*. It would have been more convincing to feature an experiment where views are truly independent when conditioned on the current state (e.g. dialog and facial expression, partial views). - It's not clear that the advantage comes from your formulation rather than just more things being learned (i.e., what you propose reduces sample complexity because it is an auxiliary task). You should have experiments confirming this. - Again, each experiment setting seems to only have a single run. All your results could be plain luck. Additional comments: - l49, "agents *to* reason" - section 2.1, iiuc, you force upon the agent to only receive information o_t^{i_t} about one view per timestep. What is the distribution of i_t? Is it dependent on state and action? Is it a choice of the agent? This should be clear in your framework. - l94 "existing *advancements* on" - l105 "thus being optimal than independent", what do you mean? "as optimal as"? "more optimal than"? - Figure 2, why are the X axes of different lengths? How did you choose when to stop training? This should be reported - Figure 4, I'm not sure I see the interest of having the X axis be in log-scale. - Table 1, Why the "\sim 360"? Seems like you should be reporting mean and variance, like "360 \pm 10" - l245, what view did you transfer? when? We need more details - l269 "primarily *encourages* higher"