Paper ID: | 6209 |
---|---|

Title: | Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games |

The paper presents nested gradient based approaches for zero sum LQ-games (where the system evolves linearly as a function of agents’ actions, and quadratic costs). The authors highlight several challenges associated when using policy optimization (PO) methods for such games such as nonconvexity wrt agents’ policy space which in-turn affects the stability of PO methods. To alleviate this, the authors present nested gradient based approaches which mitigate non-stationarity of learning in such games. The resulting approaches are also shown to have good convergence properties, they converge to the Nash equilibrium. The authors highlight that several value-based methods exist for solving zero-sum Markov games, similar to the problem setting addressed in the paper. However, empirical section does not seem to contain any comparisons against such value based methods. Such a comparison would highlight what is the benefit of the policy gradient methods developed in the work. The notion of projection (line 189, Eq4.7) needs better explanation. If the algorithm theoretically requires projection, then one can design instances when not taking the projection would make the approach not converge. The authors’ empirical results are not exhaustive enough to establish that projection is not required for most problem instances. Some discussion, and development of a case where projection is necessary can provide further insight for the approaches.

Following the author rebuttal, I am updating my review to acknowledge the theoretical contribtions in this paper that are not present in 5637. The paper is well-written; claims are justified; results are clear. This paper considers the class of zero-sum linear-quadratic games. The authors demonstrate that the stationary points of the control policy space are the Nash equilibrium of the game. The authors propose three nested gradient methods for policy optimisation in such games, with theoretical proofs of convergence and empirical results in small domains. Although not identical the theoretical work is largely the same as that presented in 5637. The only additional contribution is to analyze the algorithmic variants on toy problems. Two of these methods can be approximated in a model-free setting, although no results are presented for this case.

This paper shows that policy iteration converges to the Nash equilibrium in a LQ zero sum game. The authors also propose several nested gradient algorithms that converge to the equilibirum as a rate which is linear in a neighborhood of the equilibria. The paper is technical and very specific to the LQ case. None of the arguments used in the paper can be used in a more general case. The reader is concerned with the claim in lines 183-184. it is assumed that the set \Omega contains the Nash equilibrium. What if this is not the case? can the whole approach still be adapted or everything falls apart? Can one give conditions under which this assumption is satisfied?

This paper studies zero-sum Markov games in the context of linear quadratic systems. Despite the non-convex non-concave nature of the objective, by exploiting the specific dynamics/structure of the linear quadratic systems, the authors propose three nested gradient algorithms with theoretical guarantees. Specifically, they prove global convergence with sub-linear rate and local convergence with a linear rate. They also support their claims with empirical evidence. In the experiments, they compared their proposed algorithms against the simpler variants of them, namely alternating gradient and gradient-descent-ascent (for which convergence analysis is not available in the literature). ************************ + The paper is overall well written, and sufficient background is provided. + They have clearly distinguished their work with other theoretical works on min-max problems. ************************ Regarding Figure 1, in the theoretical analysis, is the performance ordering of the three methods (gradient, natural, and Gauss-Newton) explicitly shown? The main limitation is the (direct) applicability of the proposed algorithms to general RL problems. Does the analysis/insights of this work extend beyond LQ systems? It would be interesting to empirically evaluate the nested gradient type algorithms on some general continuous control problems. In terms of Optimal Control contributions, I am not familiar enough with the literature to assess the novelty of this work.