NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3186
Title:A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning

Reviewer 1

Originality: There is a lot of papers studying regularized RL problems (cited by this paper). Regularization can introduce many good properties to the solution of the MDPs. This paper proposes a generalized framework that includes most of the previously studied regularization based methods. This framework can lead to effective understanding of which regularizer is sparsity-generating. To my knowledge, I believe this paper contains sufficient originality. Quality: I did not check the proof line by line. But it appears to be correct. The experimental results also make sense. The study is quite complete. Clarity: This paper is well-written and is of publishing quality. Significance: The paper gives a good study of the regularized framework of solving MDPs. It provides some interesting understanding of the regularizers. I believe it is sufficiently significant to be accepted. Minor comments: * What does it mean by "multi-modality"?

Reviewer 2

-Originality: The characterization of optimal policy and bounds on the value functions are new. Related work is cited. Although some techniques are analogous to previous work (which is not bad per se, as it allows to apply more general regularisers within previous frameworks such as soft-actor-critic with small changes only), this work differs significantly from previous work and yields new insights how to obtain sparse policies or not. -Quality: The submission seems largely technically sound. Claims are supported by proofs and experiments confirm that considering more flexible regularizations can be beneficial in different tasks. There are some issues with the continuous time case, see the section on improvements for details. Further the authors claim that trigonometric and exponential functions families yield multimodal policies (line 287). However, it is not clear to me how this is different to say entropy regularisation, and why a softmax policy cannot have multiple modes (unless of course I parameterize the policy with a single Gaussian in the continuous case, but this is a different issue). Also I am not sure if I understand line 123-124, it seems to me contradictory to Proposition 1 that \mathcal{F} is closed under min. Lastly, I was also wondering how fast and scalable this approach with solving the convex optimization problem via CVXOPT is versus gradient descent approaches used in entropy-regularised settings with function approximators? -Clarity: The paper is relatively well written and organized. The convergence of a regularised policy iteration algorithm (Theorem 6) is just given in the appendix, maybe it is useful to mention this somehow in the main text also. -Significance: It is important to have a better understanding of how different regularisers affect the sparsity of the optimal policy and resulting performance errors. The contribution addresses these issues and I feel that others can build upon and benefit from the ideas presented therein. POST AUTHOR RESPONSE: I thank the authors for their feedback. Having read it along with the other reviews, I increase my score from 6 to 7. The rebuttal has largely addressed my concerns in the continuous-state case. The missing log-determinant term in the density seems to be more a typo and has apparently been included in the experiments. I am not convinced that the chosen paramterisation for the density is a particularly good choice to obtain sparse policies. However, it is a commonly used transformation (such as in entropy-regularised approaches), so I think that this choice is actually ok, also as the continuous-action case is not the main focus of the paper. The response also commented on the scalability of the proposed optimization approach versus stochastic gradient descent commonly used in related work. Overall the paper is interesting and can be accepted.

Reviewer 3

This paper is incredibly thorough and represents a clear advancement to our understanding of regularization in MDPs. (See the contributions listed above.)