Summary and Contributions: The paper studies policy gradient methods for optimizing the concave functions of the state occupancy measure. It gives a simple, clear, argument for why such a method reaches a global optimum when the parameterized policy class contains all policies. The paper also provides a formula for the gradient in terms of value function space. A couple of numerical experiments are conducted in a simple environment.
Strengths: Theorem 3 is simple (which is good!) and clear. There is a lot of conceptual value to the community in this perspective and in the observation that it allows one to optimize more general objectives. The gradient estimation procedure in Theorem 1 also took creativity.
Weaknesses: Theorem 13, while creative, has some major weaknesses. It can hold in tabular cases, where every possible policy is in the policy class. But it's not clear if there are any other meaningful examples. (Certainly, it requires that the policy class is convex, but since the mapping \pi \mapsto \lambda(\pi) is nonconvex, it requires much more than that) . Assumption 1 also rules out softmax policies. It effectively requires the Jacobian matrix has eigenvalues bounded above and below by constants. It also requires \Theta is closed. These fail for softmax policies (the derivatives can vanish, the paraemter space is unbounded). The authors should emphasize this. I don't see how Theorem 1 would be applied in settings with an extremely large state space.
Correctness: The claims look correct, but other than proof of Thm 1, I did not read the proofs in the appendix.
Clarity: The writing is fine.
Relation to Prior Work: The paper should highlight that a "dual" or saddle-point perspective on the classical policy gradient theorem is not new. See for example this introductory paper: "Reinforcement Learning via Fenchel-Rockafellar Duality." [18] seems to treat general problems with concave objectives and not just maximum entropy exploration. The authors should describe why they feel their approach offers advantages.
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: The paper addresses the problem of extending the cumulative rewards objective with more general utilities, such as any concave function of the state-action occupancies. Especially, it presents a Variational Policy Gradient Theorem for the general setting, which casts the policy gradient as the solution of a stochastic saddle-point problem. It further provides (1) a principled method to estimate the newly introduced policy gradient, (2) a convergence analysis of the corresponding policy optimization process, and (3) a brief numerical validation.
Strengths: I found extremely relevant the idea of providing a theoretical understanding of RL beyond reward maximization, which also provides a compelling unifying perspective for a large body of work in the still growing threads of risk-averse RL, pure exploration, and imitation learning. The formulation of the Variational Policy Gradient is really nice, even if it might result too complicated for practical policy optimization. The theoretical results are impressive, considering an error rate that matches the one of traditional policy gradient and a convergence rate of O(1/t), which also improves the best known rate in the special cumulative rewards case (tabular policies), and it upgrades to exponential convergence in the relevant KL-regularized setting.
Weaknesses: My evaluation for the work is mostly positive. I detail below some additional comments and questions that authors might address in their response.
Correctness: I did not check the proofs in detail but the methodology seems sound. The empirical methodology seems also ok, but the results could be reported more clearly (see point 5 below)
Clarity: Yes
Relation to Prior Work: Partly, see points 6 and 7 below
Reproducibility: Yes
Additional Feedback: Update: thanks for the answer, it helped clarify some points. I think the proposed additions will improve the clarity of the paper. ---- 1. While providing a common theoretical ground for general utilities in RL is not a minor contribution by any means, I would have loved to find a discussion on how to build upon these results. Do authors think their work can be leveraged to develop more efficient algorithm in the context of RL with general utilities, or the intended outcome is a deeper understanding of the setting without particular practical upsides? 2. Where the Variational Policy Gradient approach stands in comparison with other policy optimization methods for (specific) general utilities, e.g. CVaR policy optimization [e.g, Chow and Ghavamzadeh, Algorithms for CVaR optimization in MDPs, 2014] or MaxEnt policy optimization [Hazan et al., 2019]? The additional generality imposes any cost in the efficiency of the optimization? 3. Can you comment on the interpretation of the reward term z, which is a quite interesting by-product of the Variational Policy Gradient estimation? What happens if we fully optimize a policy w.r.t. this reward function? How that would compare with the MaxEnt algorithm, which casts policy optimization as a sequence of cumulative rewards problems? 4. The reported results are restricted to stationary Markovian policies. This is a common choice for cumulative rewards objective, since it is well-known that this policy space suffices. Is it also the case for general utilities? 5. I found Figure 1 quite confusing. It is not clear to me if the curves are function of the number of samples, episodes, or iterations. While the estimate for any considered utility converges to the true gradient, estimating the gradient for the entropy objective seems quite inefficient in practice. Oddly enough, outside of the gradient of cumulative rewards, all the estimates point to opposite direction w.r.t. the true gradient at first. 6. In a sense, this work provides a "concretization" of the very abstract framework of [1]. I think the concept of "hidden convexity" is already there, and it is nice to see it finally exploited. You should at least mention [1]. 7. I would like to see a discussion on the "O-REPS" [2] family of algorithms. I am not too familiar with those, but from my understanding they optimize in the space of occupancy measures directly. The main critique (see e.g. [3]) is that they are not practical, since one cannot obtain the optimal policy parameters for general parametric policies. You method seems an improvement in that sense. 8. Typo at line 456: we have we omit -> we omit 9. How gamma and delta terms can be avoided when computing the gradients of z and x in equation (18,19)? [1] Casey Chu, Jose H. Blanchet, Peter W. Glynn: Probability Functional Descent: A Unifying Perspective on GANs, Variational Inference, and Reinforcement Learning. ICML 2019: 1213-1222 [2] Zimin, A. and Neu, G. Online learning in episodic marko-vian decision processes by relative entropy policy search.InAdvances in neural information processing systems,pp. 1583–1591, 2013 [3] Yonathan Efroni, Lior Shani, Aviv Rosenberg, Shie Mannor: Optimistic Policy Optimization with Bandit Feedback. CoRR abs/2002.08243 (2020)
Summary and Contributions: The paper solves a fundamental problem where the objectives of RL are general utilities. The authors derive a new variational policy gradient and a variational Monte Carlo gradient estimation method based on sampled paths. Theoretical analysis of the properties and empirical experimental results are also given.
Strengths: This paper develops a new optimization algorithm for reinforcement learning with general utilities. The convergence rate of the new method is also given when the problem satisfies certain convexity. Empirical comparision with policy gradient is given to demenstrate its benefits.
Weaknesses: The proposed variational policy gradient algorithm converges more slowly than RN-Reinforce, which may influence its applications in real setting.
Correctness: The logic is sound and partially supported by experiments.
Clarity: The paper is easy to follow and clearly written.
Relation to Prior Work: Related works are well addressed.
Reproducibility: Yes
Additional Feedback: