Reviews: Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy

This paper studied the proximal policy optimization (as well as TRPO), where the policy and action-value function are parameterized by two-layer neural networks. Due to the overparameterization of neural networks, the convergence rate for the policy improvement and policy evaluation can be analyzed. The overall convergence of the PPO algorithm then leverages the global convergence of infinite-dimensional mirror descent. Though I found the paper provides a valid of proof of convergence for PPO algorithm, I don't find the proof techniques to be novel enough. It appears that the entire proof lies in putting together several pieces of techniques/proofs that are already in existing work. More specifically, the proofs for policy improvement and policy evaluation are straightforward adaptations of the existing proofs for supervised learning over overparameterized neural networks. The overall convergence adapts the existing proof for mirror descent algorithm. The error propagation adapts the proof for the neural network Q-learning. Hence, the particular contribution in this paper is not very significant. More specific comments are as follows. - The proofs for the policy improvement and policy evaluation require taking average over the random initialization. However, the statements of the theorems (Theorems 4.5 and 4.6) do not include such an averaging. Please explain. - Typically, PPO even under linear function approximation may not convergence to global minimum. Is the convergence here due to the sufficiently large function class (i.e., Assumption 4.3)?

Reviewer 2

Originality: The authors apply the idea that overparametrization induces local linearization, which has been documented for supervised learning, and in another submission for TD learning. In particular, they decompose the error into two terms, one due to TD, and the other due to SGD, and incorporate them in the analysis of infinite-dimensional mirror descent. The insight that the previous previous analysis for TD could be generalised to a meta algorithm that includes both TD and SGD as particular cases is key. Related work is adequately cited, and differences with previous works are clearly stated, including differences with the sister submission [5]. Quality: The submission seems technically sound, and includes detailed proofs (I just skimmed through them). This is a complete piece of work. The main limitation of the analysis is the assumption on the neural network architecture, which according to (3.1) seems to be a multi-layer perceptron with a single hidden layer (as opposed to deep architectures with two hidden layers). No experimental results have been included. Clarity: The paper is well organised and clearly written. Significance: The submission addresses the convergence of PPO with a specific neural architecture, which is a very difficult task due to the nonconvexity of the objective, the parametrization of both value function and policy, and the stochastic updates. And they solve it with an intuitive approach. I believe the results are relevant, and will likely inspire further results on the nonasymptotic analysis of other deep-RL methods; especially if the local linearization effect induced by overparametrization can be identified in other neural network architectures. META REVIEW ============ The authors have addressed my comments. I appreciate their commitment to add simulation results to the paper. On the other hand, although I was very enthusiastic when reading this paper, I don't find the authors' response to Reviewer 1's concerns very satisfactory. In particular, the authors claim: "Our proof of the convergence of PPO/TRPO has several building blocks, which, to the best of our knowledge, is not covered by existing work", namely the "unified analysis of both TD and SGD", the "analysis of infinite-dimensional mirror descent under one-point monotonicity" and the "error propagation" of tracking a policy. It turns out that the first two points are very related to another submission: Paper 6047 'Neural Temporal-Difference Learning Converges to Global Optima'. Indeed, these are sentences copied from Paper 6047: - "overparametrization allows us to establish a notion of one-point monotonicity [25, 19] for the semigradients followed by neural TD, which ensures its evolution towards the global optimum of MSPBE along the solution path. Such a notion of monotonicity enables us to circumvent the first and second obstacles of bias and nonconvexity." - "In Section 4.2, we establish the nonasymptotic global rates of convergence of neural TD to the global optimum of the MSPBE when following the population semigradients in (3.3) and the stochastic semigradients in (3.1), respectively." - "The key to the proof of Theorem 4.4 is the one-point monotonicity of the population semigradient g(t), which is established through the local linearization..." Paper 6047 also covers Soft Q-learning and explicitly says that "our global convergence of neural soft Q-learning extends to a variant of the actor-critic algorithm." However, the relationship between that analysis and the error propagation of the current submission is not obvious. Therefore, even I still think the current paper is very relevant, it makes me think that the key idea of generalising previous neural-TD analysis is more incremental, so I reduce the score to 7.

Reviewer 3

The paper is a concrete work as theoretical work. The authors use overparameterized neural networks as Q functions, which should be inspired by the recent advance in this area. The convergence rate seems to be optimal to the problem, at least to my imagination. The paper, however, lacks an experimental section to validate the convergence rate is tight or not, which is a pity. Or there could be some analysis to compensate for the experiments in existing papers, which should convey insight to wider society. The overparameterization assumption seems to be not practical in actual experiments, at least to my recollection, although I understand this is probably not a job for an RL paper to accomplish. Besides, the paper seems to have no conclusion section? At line 37, the authors claim about solving the problem of infinite-dimensionality, nonconvexity comes from? could the authors elaborate a little more for the technical difficulty in the specific function? In algorithm 1, what is the criterion for line 5 and 6 to converge? or is it only one gradient descent step?

Paper ID:	5585
Title:	Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy