NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 3096 Variance Reduced Policy Evaluation with Smooth Function Approximation

### Reviewer 1

Quality: The paper studies policy evaluation with smooth nonlinear function approximation in the batch setting. Authors consider the mean square projected bellman error as objective and cast it to saddle-point problem using convex conjugate trick to avoid double sampling (which is used in several works). The obtained saddle point problem is strongly concave with respect to the dual variable and nonconvex with respect to the primal variable. Then, they extend SAGA to devise a variance reduced algorithm and provide it convergence rate to a stationary point. Some detailed comments: - in line 116: the definition if the projection \Pi is wrong. \Pi is rather the orthogonal projection on the tangent space that passes by V_{\theta} and translated to the origin. I encourage the authors to go back the original paper of Maei et al. - in equation (9): the objective J(\theta) should be devided by 1/2 because the conjugate of 1/2||x||^2 is 1/2||x||^2. - Authors seem to use interchangeably stationary point and saddle point. while a stationary point is not necessarily saddle point. Could authors explain why it is the case in their setting? - I encourage authors to show why assumption 1 implies the boundedness of the dual variables iterates. it is not obvious !! - Also I am curious about the convergence rate of the algorithm without variance reduction i.e using vanilla stochastic gradient method. Clarity: The paper is pretty readable. Originality: The paper builds on many previous works: minimizing mean square projected bellman error, convex conjugate trick, SAGA for non convex smooth optimization. But the theoretical result is still novel and interesting. Significance: The paper would be stronger if it includes discussion on the algorithm solution in term of value function error and on practical implementation as the algorithm involves computing a Hessian matrix. Besides, empirical results would be very appreciated.

### Reviewer 2

Overall, the paper made significant contribution to both the reinforcement learning community and optimization community. The proposed algorithm is a variant of non-convex SAGA algorithm introduced by [1]. The novelty comes from their proof for the non-convex but strongly concave case. There are several issues which should be addressed: 1, Recasting the policy evaluation as a primal-dual optimization via the Fenchel duality technique is not new. In fact, [2,3,4] have already exploit this reformulation. First, these related work should be referred appropriately. Given these existing work, this contribution in the paper is relatively straightforward. 2, The choice of the \beta plays a vital role to establish the convergence rate. In fact, the finite-step convergence rate highly depends on the choice of \beta. In the proof in Appendix, the beta should satisfy the conditions in Eq. (a0), (a3), and (a4). However, the existence of \beta satisfying such conditions is not explicit checked. Without the existence of such beta, the convergence rate proposed in the paper is not valid. In sum, this is an interesting work to exploit the advanced optimization technique for reinforcement learning problem. However, there are several conditions in the proof should be treated carefully and rigorously. I am happy to raise the score if the authors can address my concerns. 1] S. J. Reddi, S. Sra, B. Poczos, and A. J. Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems, pages 1145–1153, 2016. [2] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik. Finite-sample analysis of proximal gradient TD algorithms. In Conference on Uncertainty in Artificial Intelligence, pages 504–513, 2015. [3] Dai, Bo, He, Niao, Pan, Yunpeng, Boots, Byron, and Song, Le. Learning from conditional distributions via dual embeddings. arXiv:1607.04579, 2016. [4] S. S. Du, J. Chen, L. Li, L. Xiao, and D. Zhou. Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning, pages 1049–1058, 2017.

### Reviewer 3

This paper studied an off-line TD algorithms via min-max optimization and provided a non-asymptotic analysis of the convergence rate. I do not recommend to accept the paper due to the following reasons. - In my opinion, the paper analyzed only a generic finite-sum min-max optimization problem, not a real analysis for TD algorithm. Here is my reasoning. Although the paper motivates the problem by TD learning, its actual formulation of the problem starts by assuming that a trajectory of state-action sequence is given, and is fixed and deterministic throughout the analysis. Then the problem becomes a typical finite-sum min-max optimization. Throughout the analysis, the randomness of the state and action variables due to their generation via a Markov chain does not play a role, because these variables are treated as fixed (the authors can clarify this in the rebuttal process). Hence, the analysis and results do not reflect any special property of TD algorithms but only min-max minimization. - I only view that paper provides the convergence analysis for the finite-sum min-max problem. For this, the paper proposed variance reduced gradient type of algorithm and characterized the convergence rate. While I do think this makes a contribution, the proof mainly follows from the standard techniques. And such a contribution is not sufficient for publication at NeurIPS. --------------------- After authors' feedback The authors' response did not provide a good answer to my question about their contributions on TD learning. The paper solves only a finite-sum min-max optimization problem, but the claim of contributions (in the title, abstract, Section 1 of introduction, Section 2 of problem setup, etc) was on policy evaluation via nonlinear function approximation in reinforcement learning, which is a very different and much more challenging problem. This is quite misleading for readers to understand the true contribution here. The way that this paper addresses the TD learning problem (see more detail below) does not fit it into the state of the art of TD learning analysis. For the possible interest of the authors, I explain below why I think that the paper does not address the TD learning problem. TD learning has the goal of learning a value function for a policy, and the paper starts from a valid population (in expectation) objective function eq (8) and eq (10) to achieve such a goal. It has been shown (by existing literature) that the solution of such an objective function (eq (8) or eq (10)) does provide desired value function (a good enough approximation to the true value function in the function space). However, the paper does not solve eq (10), but in fact only solves a finite-sum problem eq (16), which is based on a FIXED state-action sequence. The convergence guarantee is established to show the algorithm converges to the saddle point of the finite-sum objective eq (16). Clearly we wonder whether such a solution is desirable for TD learning, i.e., how well such a solution approximates the true value function. There is no such an answer in the paper. One would naturally think that this can be argued by bridging the solution of eq (16) and solution of the original TD objective eq (10), but the paper does not fill this gap. This can be clearly seen from the fact that the analysis of the paper does not even exploit the statistical distribution of state-action and without incorporating this I don't see a way to connect eq (10) and eq (16). Consequently, the paper does not solve the real challenge in the nonlinear TD learning, where we wonder whether we obtain a good enough approximation to the true value function or just a local optimal solution for the chosen objective function.