{"title": "Neural Trust Region/Proximal Policy Optimization Attains Globally Optimal Policy", "book": "Advances in Neural Information Processing Systems", "page_first": 10565, "page_last": 10576, "abstract": "Proximal policy optimization and trust region policy optimization (PPO and TRPO) with actor and critic parametrized by neural networks achieve significant empirical success in deep reinforcement learning. However, due to nonconvexity, the global convergence of PPO and TRPO remains less understood, which separates theory from practice. In this paper, we prove that a variant of PPO and TRPO equipped with overparametrized neural networks converges to the globally optimal policy at a sublinear rate. The key to our analysis is the global convergence of infinite-dimensional mirror descent under a notion of one-point monotonicity, where the gradient and iterate are instantiated by neural networks. In particular, the desirable representation power and optimization geometry induced by the overparametrization of such neural networks allow them to accurately approximate the infinite-dimensional gradient and iterate.", "full_text": "Neural Proximal/Trust Region Policy Optimization\n\nAttains Globally Optimal Policy\n\nBoyi Liu\u21e4\u2020\n\nQi Cai\u21e4\u2021\n\nZhuoran Yang\u00a7\n\nZhaoran Wang\u00b6\n\nAbstract\n\nProximal policy optimization and trust region policy optimization (PPO and\nTRPO) with actor and critic parametrized by neural networks achieve signi\ufb01cant\nempirical success in deep reinforcement learning. However, due to nonconvexity,\nthe global convergence of PPO and TRPO remains less understood, which sepa-\nrates theory from practice. In this paper, we prove that a variant of PPO and TRPO\nequipped with overparametrized neural networks converges to the globally opti-\nmal policy at a sublinear rate. The key to our analysis is the global convergence\nof in\ufb01nite-dimensional mirror descent under a notion of one-point monotonicity,\nwhere the gradient and iterate are instantiated by neural networks.\nIn particu-\nlar, the desirable representation power and optimization geometry induced by the\noverparametrization of such neural networks allow them to accurately approxi-\nmate the in\ufb01nite-dimensional gradient and iterate.\n\n1\n\nIntroduction\n\nPolicy optimization aims to \ufb01nd the optimal policy that maximizes the expected total reward through\ngradient-based updates. Coupled with neural networks, proximal policy optimization (PPO) [40]\nand trust region policy optimization (TRPO) [39] are among the most important workhorses behind\nthe empirical success of deep reinforcement learning across applications such as games [34] and\nrobotics [13]. However, the global convergence of policy optimization, including PPO and TRPO,\nremains less understood due to multiple sources of nonconvexity, including (i) the nonconvexity of\nthe expected total reward over the in\ufb01nite-dimensional policy space and (ii) the parametrization of\nboth policy (actor) and action-value function (critic) using neural networks, which leads to noncon-\nvexity in optimizing their parameters. As a result, PPO and TRPO are only guaranteed to monoton-\nically improve the expected total reward over the in\ufb01nite-dimensional policy space [23, 24, 39, 40],\nwhile the global optimality of the attained policy, the rate of convergence, as well as the impact\nof parametrizing policy and action-value function all remain unclear. Such a gap between theory\nand practice hinders us from better diagnosing the possible failure of deep reinforcement learning\n[37, 19, 21] and applying it to critical domains such as healthcare [28] and autonomous driving [38]\nin a more principled manner.\nClosing such a theory-practice gap boils down to answering three key questions: (i) In the ideal case\nthat allows for in\ufb01nite-dimensional policy updates based on exact action-value functions, how do\nPPO and TRPO converge to the optimal policy? (ii) When the action-value function is parametrized\nby a neural network, how does temporal-difference learning (TD) [41] converge to an approximate\naction-value function with suf\ufb01cient accuracy within each iteration of PPO and TRPO? (iii) When\n\n\u21e4equal contribution\n\u2020Northwestern University; boyiliu2018@u.northwestern.edu\n\u2021Northwestern University; qicai2022@u.northwestern.edu\n\u00a7Princeton University; zy6@princeton.edu\n\u00b6Northwestern University; zhaoranwang@gmail.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe policy is parametrized by another neural network, based on the approximate action-value func-\ntion attained by TD, how does stochastic gradient descent (SGD) converge to an improved policy\nthat accurately approximates its ideal version within each iteration of PPO and TRPO? However,\nthese questions largely elude the classical optimization framework, as questions (i)-(iii) involve non-\nconvexity, question (i) involves in\ufb01nite-dimensionality, and question (ii) involves bias in stochastic\n(semi)gradients [44, 42]. Moreover, the policy evaluation error arising from question (ii) compounds\nwith the policy improvement error arising from question (iii), and they together propagate through\nthe iterations of PPO and TRPO, making the convergence analysis even more challenging.\n\nContribution. By answering questions (i)-(iii), we establish the \ufb01rst nonasymptotic global rate\nof convergence of a variant of PPO (and TRPO) equipped with neural networks.\nIn detail,\nwe prove that, with policy and action-value function parametrized by randomly initialized and\noverparametrized two-layer neural networks, PPO converges to the optimal policy at the rate of\nO(1/pK), where K is the number of iterations. For solving the subproblems of policy evaluation\nand policy improvement within each iteration of PPO, we establish nonasymptotic upper bounds of\nthe numbers of TD and SGD iterations, respectively. In particular, we prove that, to attain an \u270f accu-\nracy of policy evaluation and policy improvement, which appears in the constant of the O(1/pK)\nrate of PPO, it suf\ufb01ces to take O(1/\u270f2) TD and SGD iterations, respectively.\nMore speci\ufb01cally, to answer question (i), we cast the in\ufb01nite-dimensional policy updates in the ideal\ncase as mirror descent iterations. To circumvent the lack of convexity, we prove that the expected\ntotal reward satis\ufb01es a notation of one-point monotonicity [14], which ensures that the ideal policy\nsequence evolves towards the optimal policy. In particular, we show that, in the context of in\ufb01nite-\ndimensional mirror descent, the exact action-value function plays the role of dual iterate, while the\nideal policy plays the role of primal iterate [31, 32, 36]. Such a primal-dual perspective allows us to\ncast the policy evaluation error in question (ii) as the dual error and the policy improvement error in\nquestion (iii) as the primal error. More speci\ufb01cally, the dual and primal errors arise from using neural\nnetworks to approximate the exact action-value function and the ideal improved policy, respectively.\nTo characterize such errors in questions (ii) and (iii), we unify the convergence analysis of TD for\nminimizing the mean squared Bellman error (MSBE) [7] and SGD for minimizing the mean squared\nerror (MSE) [22, 27, 10, 3, 54, 8, 9, 26, 5], both over neural networks. In particular, we show that\nthe desirable representation power and optimization geometry induced by the overparametrization of\nneural networks enable the global convergence of both the MSBE and MSE, which correspond to the\ndual and primal errors, at a sublinear rate to zero. By incorporating such errors into the analysis of\nin\ufb01nite-dimensional mirror descent, we establish the global rate of convergence of PPO. As a side\nproduct, the proof techniques developed here for handling nonconvexity, in\ufb01nite-dimensionality,\nsemigradient bias, and overparametrization may be of independent interest to the analysis of more\ngeneral deep reinforcement learning algorithms. In addition, it is worth mentioning that, when the\nactivation functions of neural networks are linear, our results cover the classical setting with linear\nfunction approximation, which encompasses the classical tabular setting as a special case.\n\nMore Related Work. PPO [40] and TRPO [39] are proposed to improve the convergence of vanilla\npolicy gradient [49, 43] in deep reinforcement learning. Related algorithms based on the idea of\nKL-regularization include natural policy gradient and actor-critic [23, 35], entropy-regularized pol-\nicy gradient and actor-critic [29], primal-dual actor-critic [12, 11], soft Q-learning and actor-critic\n[17, 18], and dynamic policy programming [6]. Despite its empirical success, policy optimization\ngenerally lacks global convergence guarantees due to nonconvexity. One exception is the recent\nanalysis by [33], which establishes the global convergence of TRPO to the optimal policy. However,\n[33] require in\ufb01nite-dimensional policy updates based on exact action-value functions and do not\nprovide the nonasymptotic rate of convergence. In contrast, we allow for the parametrization of\nboth policy and action-value function using neural networks and provide the nonasymptotic rate of\nPPO as well as the iteration complexity of solving the subproblems of policy improvement and pol-\nicy evaluation. In particular, based on the primal-dual perspective of reinforcement learning [36], we\ndevelop a concise convergence proof of PPO as in\ufb01nite-dimensional mirror descent under one-point\nmonotonicity, which is of independent interest. In addition, we refer to the closely related concurrent\nwork [2] for the global convergence analysis of (natural) policy gradient for discrete state and action\nspaces as well as continuous state space with linear function approximation. See also the concurrent\nwork [52], which studies continuous state space with general function approximation, but only es-\n\n2\n\n\ftablishes the convergence to a locally optimal policy. In addition, in our companion paper [48], we\nestablish the global convergence of neural (natural) policy gradient.\n\n2 Background\n\nIn this section, we brie\ufb02y introduce the general setting of reinforcement learning as well as PPO\nand TRPO.\n\nMarkov Decision Process. We consider the Markov decision process (S,A,P, r, ), where S\nis a compact state space, A is a \ufb01nite action space, P : S\u21e5S\u21e5A!\nR is the transition kernel,\nr : S\u21e5A ! R is the reward function, and 2 (0, 1) is the discount factor. We track the performance\nof a policy \u21e1 : A\u21e5S! R using its action-value function (Q-function) Q\u21e1 : S\u21e5A! R, which is\nde\ufb01ned as\n\nCorrespondingly, the state-value function V \u21e1 : S! R of a policy \u21e1 is de\ufb01ned as\n\nQ\u21e1(s, a) = (1 ) \u00b7 E\uf8ff 1Xt=0\nV \u21e1(s) = (1 ) \u00b7 E\uf8ff 1Xt=0\n\nt \u00b7 r(st, at) s0 = s, a0 = a, at \u21e0 \u21e1(\u00b7| st), st+1 \u21e0P (\u00b7| st, at).\nt \u00b7 r(st, at) s0 = s, at \u21e0 \u21e1(\u00b7| st), st+1 \u21e0P (\u00b7| st, at).\n\n(2.1)\nThe advantage function A\u21e1 : S\u21e5A! R of a policy \u21e1 is de\ufb01ned as A\u21e1(s, a) = Q\u21e1(s, a) \nV \u21e1(s). We denote by \u232b\u21e1(s) and \u21e1(s, a) = \u21e1(a| s) \u00b7 \u232b\u21e1(s) the stationary state distribution and\nthe stationary state-action distribution associated with a policy \u21e1, respectively. Correspondingly, we\ndenote by E\u21e1 [\u00b7 ] and E\u232b\u21e1 [\u00b7 ] the expectations E(s,a)\u21e0\u21e1 [\u00b7 ] = Ea\u21e0\u21e1(\u00b7 | s),s\u21e0\u232b\u21e1(\u00b7)[\u00b7 ] and Es\u21e0\u232b\u21e1 [\u00b7 ],\nrespectively. Meanwhile, we denote by h\u00b7,\u00b7i the inner product over A, e.g., we have V \u21e1(s) =\nEa\u21e0\u21e1(\u00b7 | s)[Q\u21e1(s, a)] = hQ\u21e1(s,\u00b7),\u21e1 (\u00b7| s)i.\nPPO and TRPO. At the k-th iteration of PPO, the policy parameter \u2713 is updated by\n\n\u2713k+1 argmax\n\n\u2713\n\nbE\uf8ff \u21e1\u2713(a| s)\n\u21e1\u2713k (a| s) \u00b7 Ak(s, a) k \u00b7 KL(\u21e1\u2713(\u00b7| s)k \u21e1\u2713k (\u00b7| s)),\n\n(2.2)\n\nwhere Ak is an estimator of A\u21e1\u2713k andbE[\u00b7 ] is taken with respect to the empirical version of \u21e1\u2713k\n\n,\nthat is, the empirical stationary state-action distribution associated with the current policy \u21e1\u2713k. In\npractice, the penalty parameter k is adjusted by line search.\nAt the k-th iteration of TRPO, the policy parameter \u2713 is updated by\n\nbE\uf8ff \u21e1\u2713(a| s)\n\u21e1\u2713k (a| s) \u00b7 Ak(s, a),\n\n\u2713k+1 argmax\n\nsubject to KL(\u21e1\u2713(\u00b7| s)k \u21e1\u2713k (\u00b7| s)) \uf8ff ,\n\n(2.3)\nwhere is the radius of the trust region. The PPO update in (2.2) can be viewed as a Lagrangian\nrelaxation of the TRPO update in (2.3) with Lagrangian multiplier k, which implies their updates\nare equivalent if k is properly chosen. Without loss of generality, we focus on PPO hereafter.\nIt is worth mentioning that, compared with the original versions of PPO [40] and TRPO [39], the\nvariants in (2.2) and (2.3) use KL(\u21e1\u2713(\u00b7| s)k \u21e1\u2713k (\u00b7| s)) instead of KL(\u21e1\u2713k (\u00b7| s)k \u21e1\u2713(\u00b7| s)). In Sec-\ntions 3 and 4, we show that, as the original versions, such variants also allow us to approximately\nobtain the improved policy \u21e1\u2713k+1 using SGD, and moreover, enjoy global convergence.\n\n\u2713\n\n3 Neural PPO\n\nWe present more details of PPO with policy and action-value function parametrized by neural net-\nworks. For notational simplicity, we denote by \u232bk and k the stationary state distribution \u232b\u21e1\u2713k\nand\n, respectively. Also, we de\ufb01ne an auxiliary distribution\nthe stationary state-action distribution \u21e1\u2713k\n\nek over S\u21e5A asek = \u232bk\u21e10.\nNeural Network Parametrization. Without loss of generality, we assume that (s, a) 2 Rd for all\ns 2S and a 2A . We parametrize a function u : S\u21e5A! R, e.g., policy \u21e1 or action-value function\n\n3\n\n\fQ\u21e1, by the following two-layer neural network, which is denoted by NN(\u21b5; m),\n\n1\npm\n\nmXi=1\n\nbi\n\n[\u21b5(0)]i\n\nu\u21b5(s, a) =\n\nfor all i 2 [m].\n\nbi \u00b7 ([\u21b5]>i (s, a)).\n\ni.i.d.\u21e0N (0, Id/d),\n\ni.i.d.\u21e0 Unif({1, 1}),\n\n(3.1)\nHere m is the width of the neural network, bi 2 {1, 1} (i 2 [m]) are the output weights, (\u00b7) is the\nrecti\ufb01ed linear unit (ReLU) activation, and \u21b5 = ([\u21b5]>1 , . . . , [\u21b5]>m)> 2 Rmd with [\u21b5]i 2 Rd (i 2 [m])\nare the input weights. We consider the random initialization\n(3.2)\nWe restrict the input weights \u21b5 to an `2-ball centered at the initialization \u21b5(0) by the projection\n\u21e7B0(R\u21b5)(\u21b50) = argmin\u21b52B0(R\u21b5){k\u21b5 \u21b50k2}, where B0(R\u21b5) = {\u21b5 : k\u21b5 \u21b5(0)k2 \uf8ff R\u21b5}.\nThroughout training, we only update \u21b5, while keeping bi (i 2 [m]) \ufb01xed at the initialization. Hence,\nwe omit the dependency on bi (i 2 [m]) in NN(\u21b5; m) and u\u21b5(s, a).\nPolicy Improvement. We consider the population version of the objective function in (2.2),\n\nL(\u2713) = E\u232bk\u21e5hQ!k (s,\u00b7),\u21e1 \u2713(\u00b7| s)i k \u00b7 KL(\u21e1\u2713(\u00b7| s)k \u21e1\u2713k (\u00b7| s))\u21e4,\n\n(3.3)\nwhere Q!k is an estimator of Q\u21e1\u2713k , that is, the exact action-value function of \u21e1\u2713k. In the follow-\ning, we convert the subproblem max\u2713 L(\u2713) of policy improvement into a least-squares subprob-\nlem. We consider the energy-based policy \u21e1(a| s) / exp{\u23271f (s, a)}, which is abbreviated as\n\u21e1 / exp{\u23271f}. Here f : S\u21e5A! R is the energy function and \u2327> 0 is the temperature\nparameter. We have the following closed form of the ideal in\ufb01nite-dimensional policy update. See\nalso, e.g., [1] for a Bayesian inference perspective.\nProposition 3.1. Let \u21e1\u2713k / exp{\u23271\nQ\u21e1\u2713k , the updateb\u21e1k+1 argmax\u21e1{E\u232bk [hQ!k (s,\u00b7),\u21e1 (\u00b7| s)i k \u00b7 KL(\u21e1(\u00b7| s)k \u21e1\u2713k (\u00b7| s))]} gives\n\nk f\u2713k} be an energy-based policy. Given an estimator Q!k of\n(3.4)\n\nk Q!k + \u23271\n\nk f\u2713k}.\n\nProof. See Appendix C for a detailed proof.\n\nb\u21e1k+1 / exp{1\n\nHere we note that the closed form of ideal in\ufb01nite-dimensional update in (3.4) holds state-wise. To\n\nrepresent the ideal improved policyb\u21e1k+1 in Proposition 3.1 using the energy-based policy \u21e1\u2713k+1 /\nexp{\u23271\n\nk+1f\u2713k+1}, we solve the subproblem of minimizing the MSE,\n\n(3.5)\nwhich is justi\ufb01ed in Appendix B as a majorization of L(\u2713) de\ufb01ned in (3.3). Here we use the neural\nnetwork parametrization f\u2713 = NN(\u2713; mf ) de\ufb01ned in (3.1), where \u2713 denotes the input weights and\n\nEek\u21e5f\u2713(s, a) \u2327k+1 \u00b7 (1\n\nk f\u2713k (s, a))2\u21e4,\n\n\u2713k+1 argmin\n\u27132B0(Rf )\n\nk Q!k (s, a) + \u23271\n\n\u21e1\u2713k+1 approximates the ideal in\ufb01nite-dimensional policy update in (3.4) evenly well over all actions.\nAlso note that the subproblem in (3.5) allows for off-policy sampling of both states and actions [1].\nTo solve (3.5), we use the SGD update\n\nmf is the width. It is worth mentioning that in (3.5) we sample the actions according toek so that\n\u2713(t + 1/2) \u2713(t) \u2318 \u00b7f\u2713(t)(s, a) \u2327k+1 \u00b7 (1\nk f\u2713k (s, a)) \u00b7r \u2713f\u2713(t)(s, a),\nwhere (s, a) \u21e0ek and \u2713(t + 1) \u21e7B0(Rf )(\u2713(t + 1/2)). Here \u2318 is the stepsize. See Appendix A\n\nPolicy Evaluation. To obtain the estimator Q!k of Q\u21e1\u2713k in (3.3), we solve the subproblem of\nminimizing the MSBE,\n\nfor a detailed algorithm.\n\nk Q!k (s, a) + \u23271\n\n(3.6)\n\n!k argmin\n!2B0(RQ)\n\nEk [(Q!(s, a) [T \u21e1\u2713k Q!](s, a))2].\n\n(3.7)\n\nHere the Bellman evaluation operator T \u21e1 of a policy \u21e1 is de\ufb01ned as\nWe use the neural network parametrization Q! = NN(!; mQ) de\ufb01ned in (3.1), where ! denotes the\ninput weights and mQ is the width. To solve (3.7), we use the TD update\n\n[T \u21e1Q](s, a) = E\u21e5(1 ) \u00b7 r(s, a) + \u00b7 Q(s0, a0) s0 \u21e0P (\u00b7| s, a), a0 \u21e0 \u21e1(\u00b7| s0)\u21e4.\n\n!(t + 1/2) !(t) \u2318 \u00b7Q!(t)(s, a) (1 ) \u00b7 r(s, a) \u00b7 Q!(t)(s0, a0) \u00b7r !Q!(t)(s, a),\n(3.8)\nwhere (s, a) \u21e0 k, s0 \u21e0P (\u00b7| s, a), a0 \u21e0 \u21e1\u2713k (\u00b7| s0), and !(t + 1) = \u21e7B0(RQ)(!(t + 1/2)). Here \u2318\nis the stepsize. See Appendix A for a detailed algorithm.\n\n4\n\n\fNeural PPO. By assembling the subproblems of policy improvement and policy evaluation, we\npresent neural PPO in Algorithm 1, which is characterized in Section 4.\n\nAlgorithm 1 Neural PPO\nRequire: MDP (S,A,P, r, ), penalty parameter , widths mf and mQ, number of SGD and TD\n1: Initialize with uniform policy: \u23270 1, f\u27130 0, \u21e1\u27130 \u21e10 / exp{\u23271\n2: for k = 0, . . . , K 1 do\n3:\n4:\n\niterations T , number of TRPO iterations K, and projection radii Rf RQ\n0 f\u27130}\nSet temperature parameter \u2327k+1 pK/(k + 1) and penalty parameter k pK\nSample {(st, at, a0\na0t \u21e0 \u21e1\u2713k (\u00b7| s0t)\nSolve for Q!k = NN(!k; mQ) in (3.7) using the TD update in (3.8) (Algorithm 3)\nSolve for f\u2713k+1 = NN(\u2713k+1; mf ) in (3.5) using the SGD update in (3.6) (Algorithm 2)\nUpdate policy: \u21e1\u2713k+1 / exp{\u23271\n\nt \u21e0 \u21e10(\u00b7| st), s0t \u21e0P (\u00b7| st, at) and\n\nk+1f\u2713k+1}\n\nt , s0t, a0t)}T\n\nt=1 with (st, at) \u21e0 k, a0\n\n5:\n6:\n7:\n8: end for\n\n4 Main Results\n\nIn this section, we establish the global convergence of neural PPO in Algorithm 1 based on character-\nizing the errors arising from solving the subproblems of policy improvement and policy evaluation\nin (3.5) and (3.7), respectively.\nOur analysis relies on the following regularity condition on the boundedness of reward.\nAssumption 4.1 (Bounded Reward). There exists a constant Rmax > 0 such that Rmax =\nsup(s,a)2S\u21e5A |r(s, a)|, which implies |V \u21e1(s)|\uf8ff Rmax and |Q\u21e1(s, a)|\uf8ff Rmax for any policy\n\u21e1.\n\nTo ensure the compatibility between the policy and the action-value function [25, 43, 23, 35, 46, 47],\nwe set mf = mQ and use the following random initialization. In Algorithm 1, we \ufb01rst generate\naccording to (3.2) the random initialization \u21b5(0) = \u2713(0) = !(0) and bi (i 2 [m]), and then use\nit as the \ufb01xed initialization of both SGD and TD in Lines 6 and 5 of Algorithm 1 for all k 2 [K],\nrespectively.\n\n4.1 Errors of Policy Improvement and Policy Evaluation\nWe de\ufb01ne the following function class, which characterizes the representation power of the neural\nnetwork de\ufb01ned in (3.1).\nDe\ufb01nition 4.2. For any constant R > 0, we de\ufb01ne the function class\n\nFR,m =\u21e2 1\n\npm\n\nmXi=1\n\nbi \u00b7 1[\u21b5(0)]>i (s, a) > 0 \u00b7 [\u21b5]>i (s, a) : k\u21b5 \u21b5(0)k2 \uf8ff R,\n\nwhere [\u21b5(0)]i and bi (i 2 [m]) are the random initialization de\ufb01ned in (3.2).\nAs m ! 1, FR,m NN(\u21b5(0); m) approximates a subset of the reproducing kernel Hilbert space\n(RKHS) induced by the kernel K(x, y) = Ez\u21e0N (0,Id/d)[1{z>x > 0, z>y > 0}x>y] [22, 27, 10,\n3, 54, 8, 9, 26, 5, 7]. Such a subset is a ball with radius R in the corresponding H-norm, which is\nknown to be a rich function class [20]. Correspondingly, for a suf\ufb01ciently large width m and radius\nR, FR,m is also a suf\ufb01ciently rich function class.\nBased on De\ufb01nition 4.2, we lay out the following regularity condition on the action-value function\nclass.\nAssumption 4.3 (Action-Value Function Class). It holds that Q\u21e1(s, a) 2F RQ,mQ for any \u21e1.\nAssumption 4.3 states that FRQ,mQ is closed under the Bellman evaluation operator T \u21e1, as Q\u21e1 is the\n\ufb01xed-point solution of the Bellman equation T \u21e1Q\u21e1 = Q\u21e1. Such a regularity condition is commonly\nused in the literature [30, 4, 16, 15, 45, 51]. In particular, [50] de\ufb01ne a class of Markov decision\nprocesses that satisfy such a regularity condition, which is suf\ufb01ciently rich due to the representation\npower of FRQ,mQ.\n\n5\n\n\fIn the sequel, we lay out another regularity condition on the stationary state-action distribution \u21e1.\nAssumption 4.4 (Regularity of Stationary Distribution). There exists a constant c > 0 such that for\nany vector z 2 Rd and \u21e3> 0, it holds almost surely that E\u21e1 [1{|z>(s, a)|\uf8ff \u21e3}| z] \uf8ff c \u00b7 \u21e3/kzk2\nfor any \u21e1.\n\nAssumption 4.4 states that the density of \u21e1 is suf\ufb01ciently regular. Such a regularity condition holds\nas long as the stationary state distribution \u232b\u21e1 has upper bounded density.\nWe are now ready present bounds for errors induced by approximation via two-layer neural net-\nworks, with analysis generalizing those of [7, 5] included in Appendix D. First, we characterize\nthe policy improvement error, which is induced by solving the subproblem in (3.5) using the SGD\nupdate in (3.6), in the following theorem. See Line 6 of Algorithm 1 and Algorithm 2 for a detailed\nalgorithm.\nTheorem 4.5 (Policy Improvement Error). Suppose that Assumptions 4.1, 4.3, and 4.4 hold. We\nset T 64 and the stepsize to be \u2318 = T 1/2. Within the k-th iteration of Algorithm 1, the output\nf\u2713 of Algorithm 2 satis\ufb01es\n\nEinit,ek\u21e5f\u2713(s, a) \u2327k+1 \u00b7 (1\n\nf T 1/2 + R5/2\n\n= O(R2\n\nf m1/4\n\nk Q!k (s, a) + \u23271\n).\n\nf m1/2\n\n+ R3\n\nk f\u2713k (s, a))2\u21e4\n\nf\n\nf\n\nProof. See Appendix D for a detailed proof.\n\nSimilarly, we characterize the policy evaluation error, which is induced by solving the subproblem\nin (3.7) using the TD update in (3.8), in the following theorem. See Line 5 of Algorithm 1 and\nAlgorithm 3 for a detailed algorithm.\nTheorem 4.6 (Policy Evaluation Error). Suppose that Assumptions 4.1, 4.3, and 4.4 hold. We set\nT 64/(1 )2 and the stepsize to be \u2318 = T 1/2. Within the k-th iteration of Algorithm 1, the\noutput Q! of Algorithm 3 satis\ufb01es\n\nEinit,k [(Q!(s, a) Q\u21e1\u2713k (s, a))2] = O(R2\n\nQT 1/2 + R5/2\n\nQ m1/4\n\nQ + R3\n\nQm1/2\nQ ).\n\nProof. See Appendix D for a detailed proof.\n\nAs we show in Sections 4.3 and 5, Theorems 4.5 and 4.6 characterize the primal and dual errors of\nthe in\ufb01nite-dimensional mirror descent corresponding to neural PPO. In particular, such errors decay\nto zero at the rate of 1/pT when the width mf = mQ is suf\ufb01ciently large, where T is the number\nof TD and SGD iterations in Algorithm 1. For notational simplicity, we omit the dependency on the\nrandom initialization in the expectations hereafter.\n\n\u21e1\n\n4.2 Error Propagation\nWe denote by \u21e1\u21e4 the optimal policy with \u232b\u21e4 being its stationary state distribution and \u21e4 being its\n\nbased on Q!k, which is an estimator of the exact action-value function Q\u21e1\u2713k . Correspondingly, we\nde\ufb01ne the ideal improved policy based on Q\u21e1\u2713k as\n\nstationary state-action distribution. Recall that, as de\ufb01ned in (3.4),b\u21e1k+1 is the ideal improved policy\n(4.1)\nBy the same proof of Proposition 3.1, we have \u21e1k+1 / exp{1\nk f\u2713k}, which is also an\nenergy-based policy.\nWe de\ufb01ne the following quantities related to density ratios between policies or stationary distribu-\ntions,\n\nE\u232bk\u21e5hQ\u21e1\u2713k (s,\u00b7),\u21e1 (\u00b7, s)i k \u00b7 KL(\u21e1(\u00b7| s)k \u21e1\u2713k (\u00b7| s))\u21e4 .\n\nk Q\u21e1\u2713k + \u23271\n\n\u21e1k+1 = argmax\n\n\u21e4k = Eek [|d\u21e4/dek d(\u21e1\u2713k \u232b\u21e4)/dek|2]1/2, \u21e4k = Ek [|d\u21e4/dk d\u232b\u21e4/d\u232bk|2]1/2,\nwhere d\u21e4/dek, d(\u21e1\u2713k \u232b\u21e4)/dek, d\u21e4/dk, and d\u232b\u21e4/d\u232bk are the Radon-Nikodym derivatives. A\n\nclosely related quantity known as the concentrability coef\ufb01cient is commonly used in the literature\n[30, 4, 16, 45, 51]. In comparison, as our analysis is based on stationary distributions, our de\ufb01nitions\nof \u21e4k and \u21e4k are simpler in that they do not require unrolling the state-action sequence. Then we\nhave the following lemma that quanti\ufb01es how the errors of policy improvement and policy evaluation\npropagate into the in\ufb01nite-dimensional policy space.\n\n(4.2)\n\n6\n\n\fLemma 4.7 (Error Propagation). Suppose that the policy improvement error in Line 6 of Algorithm\n1 satis\ufb01es\n\nk Q!k (s, a) \u23271\nand the policy evaluation error in Line 5 of Algorithm 1 satis\ufb01es\n\nEek\u21e5f\u2713k+1(s, a) \u2327k+1 \u00b7 (1\nk f\u2713k (s, a))2\u21e4 \uf8ff \u270fk+1,\nE\u232b\u21e4\u21e5\u2326log(\u21e1\u2713k+1(\u00b7| s)/\u21e1k+1(\u00b7| s)),\u21e1 \u21e4(\u00b7| s) \u21e1\u2713k (\u00b7| s)\u21b5\u21e4 \uf8ff \"k,\n\nEk [(Q!k (s, a) Q\u21e1\u2713k (s, a))2] \uf8ff \u270f0k.\n\nFor \u21e1k+1 de\ufb01ned in (4.1) and \u21e1\u2713k+1 obtained in Line 7 of Algorithm 1, we have\n\nk \u270f0k \u00b7 \u21e4k.\nProof. See Appendix E for a detailed proof.\n\nk+1\u270fk+1 \u00b7 \u21e4k+1 + 1\n\nwhere \"k = \u23271\n\n(4.3)\n\n(4.4)\n\n(4.5)\n\nLemma 4.7 quanti\ufb01es the difference between the ideal case, where we use the in\ufb01nite-dimensional\npolicy update based on the exact action-value function, and the realistic case, where we use the neu-\nral networks de\ufb01ned in (3.1) to approximate the exact action-value function and the ideal improved\npolicy.\nThe following lemma characterizes the difference between f\u2713k+1 and f\u2713k.\nLemma 4.8 (Stepwise Energy Difference). Under the same conditions of Lemma 4.7, we have\n\nE\u232b\u21e4[k\u23271\nk+1\u270f2\n\nk+1f\u2713k+1(s,\u00b7) \u23271\n\n1] \uf8ff 2\"0k + 22\nk M,\nk+1 and M = 2E\u232b\u21e4[maxa2A(Q!0(s, a))2] + 2R2\nf .\n\nk f\u2713k (s,\u00b7)k2\n\nwhere \"0k = |A| \u00b7 \u23272\nProof. See Appendix E for a detailed proof.\n\nIntuitively, the bounded difference between f\u2713k+1 and f\u2713k+1 quanti\ufb01ed in Lemma 4.8 is due to the\nKL-regularization in (3.3), which keeps the updated policy \u21e1\u2713k+1 from being too far away from the\ncurrent policy \u21e1\u2713k.\nThe differences characterized in Lemmas 4.7 and 4.8 play key roles in establishing the global con-\nvergence of neural PPO.\n\n4.3 Global Convergence of Neural PPO\n\nWe track the progress of neural PPO in Algorithm 1 using the expected total reward\n\nk=1 attained by neural PPO in Algorithm 1, we have\n\nL(\u21e1) = E\u232b\u21e4[V \u21e1(s)] = E\u232b\u21e4[hQ\u21e1(s,\u00b7),\u21e1 (\u00b7| s)i],\n\n(4.6)\nwhere \u232b\u21e4 is the stationary state distribution of the optimal policy \u21e1\u21e4. The following theorem char-\nacterizes the global convergence of L(\u21e1\u2713k ) towards L(\u21e1\u21e4). Recall that Tf and TQ are the numbers\nof SGD and TD iterations in Lines 6 and 5 of Algorithm 1, while \u21e4k and \u21e4k are de\ufb01ned in (4.2).\nTheorem 4.9 (Global Rate of Convergence of Neural PPO). Suppose that Assumptions 4.1, 4.3,\nand 4.4 hold. For the policy sequence {\u21e1\u2713k}K\n0\uf8ffk\uf8ffKL(\u21e1\u21e4) L (\u21e1\u2713k ) \uf8ff\nk+1\u270fk+1 \u00b7 \u21e4k + 1\nf T 1/2 + R5/2\n\nHere \"k = \u23271\n\u270fk+1 = O(R2\nAlso, we have M = 2E\u232b\u21e4[maxa2A(Q!0(s, a))2] + 2R2\nf .\nProof. See Section 5 for a detailed proof of Theorem 4.9. The key to our proof is the global conver-\ngence of in\ufb01nite-dimensional mirror descent with errors under one-point monotonicity, where the\nprimal and dual errors are characterized by Theorems 4.5 and 4.6, respectively.\n\n2 log |A| + M + 2PK1\n(1 ) \u00b7 pK\nk+1, where\nk+1\u270f2\nQT 1/2 + R5/2\n),\u270f 0k = O(R2\n\nk \u270f0k \u00b7 \u21e4k and \"0k = |A| \u00b7 \u23272\n\nk=0 (\"k + \"0k)\n\nf m1/4\n\nf m1/2\n\n+ R3\n\nQ m1/4\n\nQ + R3\n\nQm1/2\nQ ).\n\nmin\n\n.\n\nf\n\nf\n\nTo understand Theorem 4.9, we consider the in\ufb01nite-dimensional policy update based on the exact\naction-value function, that is, \u270fk+1 = \u270f0k = 0 for any k +1 2 [K]. In such an ideal case, by Theorem\n4.9, neural PPO globally converges to the optimal policy \u21e1\u21e4 at the rate of\n\nmin\n\n0\uf8ffk\uf8ffKL(\u21e1\u21e4) L (\u21e1\u2713k ) \uf8ff\n\n7\n\n2pM log |A|\n(1 ) \u00b7 pK\n\n,\n\n\fwith the optimal choice of the penalty parameter k =pM K/ log |A|.\nNote that Theorem 4.9 sheds light on the dif\ufb01culty of choosing the optimal penalty coef\ufb01cient in\npractice, which is observed by [40]. In particular, the optimal choice of in k = pK is given by\n\npM\n\n,\n\nk=0 (\"k + \"0k)\n\n =\n\nqlog |A| +PK1\n\nk=0 (\"k + \"0k) may vary across different deep reinforcement learning problems. As\n\nwhere M andPK1\na result, line search is often needed in practice.\nTo better understand Theorem 4.9, the following corollary quanti\ufb01es the minimum width mf and\nmQ and the minimum number of SGD and TD iterations T that ensure the O(1/pK) rate of con-\nvergence.\nCorollary 4.10 (Iteration Complexity of Subproblems and Minimum Widths of Neural Networks).\n4 + K4R10\nf \u00b7 |A|2),\nSuppose that Assumptions 4.1, 4.3, and 4.4 hold. Let mf =\u2326( K6R10\n2) for any 0 \uf8ff k \uf8ff K.\n\n2 + K2R4\n\nmQ =\u2326K2R10\nQ \u00b7 \u21e4k\n\nWe have\n\n4, and T =\u2326( K3R4\nf \u00b7 \u21e4k\n0\uf8ffk\uf8ffKL(\u21e1\u21e4) L (\u21e1\u2713k ) \uf8ff\n\nmin\n\nf \u00b7 \u21e4k\nQ \u00b7 \u21e4k\n2 log |A| + M + O(1)\n\nf \u00b7|A| + KR4\n(1 ) \u00b7 pK\n\n.\n\nProof. See Appendix F for a detailed proof.\n\nThe difference between the requirements on the widths mf and mQ in Corollary 4.10 suggests that\nthe errors of policy improvement and policy evaluation play distinct roles in the global convergence\nof neural PPO. In fact, Theorem 4.9 depends on the total error \u23271\nk \u270f0k \u00b7 \u21e4k + |A| \u00b7\n\u23272\nk+1 of the policy improvement error \u270fk+1 is much larger than the\nk+1\u270f2\nweight 1\nk+1 is a high-order term when \u270fk+1 is\nsuf\ufb01ciently small. In other words, the policy improvement error plays a more important role.\n\nk of the policy evaluation error \u270f0k, and |A| \u00b7 \u23272\n\nk+1, where the weight \u23271\n\nk+1\u270fk+1 \u00b7 \u21e4k + 1\n\nk+1\u270f2\n\n5 Proof Sketch\n\nIn this section, we sketch the proof of Theorem 4.9. In detail, we cast neural PPO in Algorithm 1\nas in\ufb01nite-dimensional mirror descent with primal and dual errors and exploit a notion of one-point\nmonotonicity to establish its global convergence.\nWe \ufb01rst present the performance difference lemma of [24]. Recall that the expected total reward\nL(\u21e1) is de\ufb01ned in (4.6) and \u232b\u21e4 is the stationary state distribution of the optimal policy \u21e1\u21e4.\nLemma 5.1 (Performance Difference). For L(\u21e1) de\ufb01ned in (4.6), we have\n\nL(\u21e1) L (\u21e1\u21e4) = (1 )1 \u00b7 E\u232b\u21e4[hQ\u21e1(s,\u00b7),\u21e1 (\u00b7| s) \u21e1\u21e4(\u00b7| s)i].\n\nProof. See Appendix G for a detailed proof.\n\nfor any \u21e1.\n\nE\u232b\u21e4[hQ\u21e1(s,\u00b7),\u21e1 (\u00b7| s) \u21e1\u21e4(\u00b7| s)i] \uf8ff 0,\n\nSince the optimal policy \u21e1\u21e4 maximizes the value function V \u21e1(s) with respect to \u21e1 for any s 2S ,\nwe have L(\u21e1\u21e4) = E\u232b\u21e4[V \u21e1\u21e4(s)] E\u232b\u21e4[V \u21e1(s)] = L(\u21e1) for any \u21e1. As a result, we have\n(5.1)\nUnder the variational inequality framework [14], (5.1) corresponds to the monotonicity of the map-\nping Q\u21e1 evaluated at \u21e1\u21e4 and any \u21e1. Note that the classical notion of monotonicity requires the\nevaluation at any pair \u21e10 and \u21e1, while we restrict \u21e10 to \u21e1\u21e4 in (5.1). Hence, we refer to (5.1) as one-\npoint monotonicity. In the context of nonconvex optimization, the mapping Q\u21e1 can be viewed as\nthe gradient of L(\u21e1) at \u21e1, which lives in the dual space, while \u21e1 lives in the primal space. Another\ncondition related to (5.1) in nonconvex optimization is known as dissipativity [53].\nThe following lemma establishes the one-step descent of the KL-divergence in the in\ufb01nite-\ndimensional policy space, which follows from the analysis of mirror descent [31, 32] as well as\nthe fact that given any \u232bk, the subproblem of policy improvement in (4.1) can be solved for each\ns 2S individually.\n\n8\n\n\fLemma 5.2 (One-Step Descent). For the ideal improved policy \u21e1k+1 de\ufb01ned in (4.1) and the current\npolicy \u21e1\u2713k, we have that, for any s 2S ,\nKL(\u21e1\u21e4(\u00b7| s)k \u21e1\u2713k+1(\u00b7| s)) KL(\u21e1\u21e4(\u00b7| s)k \u21e1\u2713k (\u00b7| s))\n\uf8ff\u2326log(\u21e1\u2713k+1(\u00b7| s)/\u21e1k+1(\u00b7| s)),\u21e1 \u2713k (\u00b7| s) \u21e1\u21e4(\u00b7| s)\u21b5 1\nk+1f\u2713k+1(s,\u00b7) \u23271\n\n\u00b7h Q\u21e1\u2713k (s,\u00b7),\u21e1 \u21e4(\u00b7| s) \u21e1\u2713k (\u00b7| s)i\nk f\u2713k (s,\u00b7),\u21e1 \u2713k (\u00b7| s) \u21e1\u2713k+1(\u00b7| s)i.\n\n1 h\u23271\n\nk\n\n 1/2 \u00b7k \u21e1\u2713k+1(\u00b7| s) \u21e1\u2713k (\u00b7| s)k2\nProof. See Appendix G for a detailed proof.\n\nBased on Lemmas 5.1 and 5.2, we prove Theorem 4.9 by casting neural PPO as in\ufb01nite-dimensional\nmirror descent with primal and dual errors, whose impact is characterized in Lemma 4.7. In partic-\nular, we employ the `1-`1 pair of primal-dual norms.\nProof of Theorem 4.9. Taking expectation with respect to s \u21e0 \u232b\u21e4 and invoking Lemmas 4.7 and\n5.2, we have\nE\u232b\u21e4[KL(\u21e1\u21e4(\u00b7| s)k \u21e1\u2713k+1(\u00b7| s))] E\u232b\u21e4[KL(\u21e1\u21e4(\u00b7| s)k \u21e1\u2713k (\u00b7| s))]\n\n\uf8ff \"k 1\n\nk\n\n\u00b7 E\u232b\u21e4[hQ\u21e1\u2713k (s,\u00b7),\u21e1 \u21e4(\u00b7| s) \u21e1\u2713k (\u00b7| s)i] 1/2 \u00b7 E\u232b\u21e4[k\u21e1\u2713k+1(\u00b7| s) \u21e1\u2713k (\u00b7| s)k2\n1]\nk+1f\u2713k+1(s,\u00b7) \u23271\n\nk f\u2713k (s,\u00b7),\u21e1 \u2713k (\u00b7| s) \u21e1\u2713k+1(\u00b7| s)i].\n\n E\u232b\u21e4[h\u23271\n\nBy Lemma 5.1 and the H\u00a8older\u2019s inequality, we further have\n\nE\u232b\u21e4[KL(\u21e1\u21e4(\u00b7| s)k \u21e1\u2713k+1(\u00b7| s))] E\u232b\u21e4[KL(\u21e1\u21e4(\u00b7| s)k \u21e1\u2713k (\u00b7| s))]\n\nk\n\n\uf8ff \"k (1 )1\n+ E\u232b\u21e4\u21e5k\u23271\n\uf8ff \"k (1 )1\n\uf8ff \"k (1 )1\n\nk+1f\u2713k+1(s,\u00b7) \u23271\n\n\u00b7 (L(\u21e1\u21e4) L (\u21e1\u2713k )) 1/2 \u00b7 E\u232b\u21e4[k\u21e1\u2713k+1(\u00b7| s) \u21e1\u2713k (\u00b7| s)k2\n1]\nk f\u2713k (s,\u00b7)k1 \u00b7k \u21e1\u2713k (\u00b7| s) \u21e1\u2713k+1(\u00b7| s)k1\u21e4\nk+1f\u2713k+1(s,\u00b7) \u23271\n\nk f\u2713k (s,\u00b7)k2\n1]\n(5.2)\nwhere in the second inequality we use 2xy y2 \uf8ff x2 and in the last inequality we use Lemma 4.8.\nRearranging the terms in (5.2), we have\n(5.3)\n\n\u00b7 (L(\u21e1\u21e4) L (\u21e1\u2713k )) + 1/2 \u00b7 E\u232b\u21e4[k\u23271\n\u00b7 (L(\u21e1\u21e4) L (\u21e1\u2713k )) + (\"0k + 2\nk M ),\n\nk\n\nk\n\n(1 )1\n\nk\n\n\u00b7 (L(\u21e1\u21e4) L (\u21e1\u2713k ))\n\n\uf8ff E\u232b\u21e4[KL(\u21e1\u21e4(\u00b7| s)k \u21e1\u2713k+1(\u00b7| s))] E\u232b\u21e4[KL(\u21e1\u21e4(\u00b7| s)k \u21e1\u2713k (\u00b7| s))] + 2\n\nk M + \"k + \"0k.\n\nTelescoping (5.3) for k + 1 2 [K], we obtain\n\nK1Xk=0\n(1 )1\n\uf8ff E\u232b\u21e4[KL(\u21e1\u21e4(\u00b7| s)k \u21e1\u2713K (\u00b7| s))] E\u232b\u21e4[KL(\u21e1\u21e4(\u00b7| s)k \u21e1\u27130(\u00b7| s))]\n\n\u00b7 (L(\u21e1\u2713k ) L (\u21e1\u21e4))\n\nk\n\n+ M\n\n2\nk +\n\nk=0 1\n\n(\"k + \"0k).\n\nK1Xk=0\n\nK1Xk=0\nk \u00b7(L(\u21e1\u21e4)L(\u21e1\u2713k )) (PK1\nlog |A| + MPK1\nSetting the penalty parameter k = pK, we havePK1\n\nNote that we have (i)PK1\nk )\u00b7min0\uf8ffk\uf8ffK{L(\u21e1\u21e4)L(\u21e1\u2713k )},\n(ii) E\u232b\u21e4[KL(\u21e1\u21e4(\u00b7| s)k \u21e1\u27130(\u00b7| s))] \uf8ff log |A| due to the uniform initialization of policy, and that (iii)\nthe KL-divergence is nonnegative. Hence, we have\nk +PK1\nk = 1pK andPK1\n\nk=0 1\nwhich together with (5.4) concludes the proof of Theorem 4.9.\n\n0\uf8ffk\uf8ffKL(\u21e1\u21e4) L (\u21e1\u2713k ) \uf8ff\n\n(1 )PK1\n\nk=0 (\"k + \"0k)\n\nk = 2,\n\nk=0 1\n\nk=0 2\n\nk=0 2\n\nk=0 1\n\n(5.4)\n\nmin\n\nk\n\n.\n\nAcknowledgement\n\nThe authors thank Jason D. Lee, Chi Jin, and Yu Bai for enlightening discussions throughout this\nproject.\n\n9\n\n\fReferences\n[1] Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N. and Riedmiller, M.\n\n(2018). Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920.\n\n[2] Agarwal, A., Kakade, S. M., Lee, J. D. and Mahajan, G. (2019). Optimality and approximation\nwith policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261.\n\n[3] Allen-Zhu, Z., Li, Y. and Liang, Y. (2018). Learning and generalization in overparameterized\n\nneural networks, going beyond two layers. arXiv preprint arXiv:1811.04918.\n\n[4] Antos, A., Szepesv\u00b4ari, C. and Munos, R. (2008). Fitted Q-iteration in continuous action-space\n\nmdps. In Advances in Neural Information Processing Systems.\n\n[5] Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019). Fine-grained analysis of optimiza-\narXiv preprint\n\ntion and generalization for overparameterized two-layer neural networks.\narXiv:1901.08584.\n\n[6] Azar, M. G., G\u00b4omez, V. and Kappen, H. J. (2012). Dynamic policy programming. Journal of\n\nMachine Learning Research, 13 3207\u20133245.\n\n[7] Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019). Neural temporal-difference learning con-\n\nverges to global optima. arXiv preprint arXiv:1905.10027.\n\n[8] Cao, Y. and Gu, Q. (2019). Generalization bounds of stochastic gradient descent for wide and\n\ndeep neural networks. arXiv preprint arXiv:1905.13210.\n\n[9] Cao, Y. and Gu, Q. (2019). A generalization theory of gradient descent for learning over-\n\nparameterized deep ReLU networks. arXiv preprint arXiv:1902.01384.\n\n[10] Chizat, L. and Bach, F. (2018). A note on lazy training in supervised differentiable program-\n\nming. arXiv preprint arXiv:1812.07956.\n\n[11] Cho, W. S. and Wang, M. (2017). Deep primal-dual reinforcement learning: Accelerating\n\nactor-critic using Bellman duality. arXiv preprint arXiv:1712.02467.\n\n[12] Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J. and Song, L. (2017). SBEED:\nConvergent reinforcement learning with nonlinear function approximation. arXiv preprint\narXiv:1712.10285.\n\n[13] Duan, Y., Chen, X., Houthooft, R., Schulman, J. and Abbeel, P. (2016). Benchmarking deep\nreinforcement learning for continuous control. In International Conference on Machine Learn-\ning.\n\n[14] Facchinei, F. and Pang, J.-S. (2007). Finite-Dimensional Variational Inequalities and Comple-\n\nmentarity Problems. Springer Science & Business Media.\n\n[15] Farahmand, A.-m., Ghavamzadeh, M., Szepesv\u00b4ari, C. and Mannor, S. (2016). Regularized pol-\nicy iteration with nonparametric function spaces. Journal of Machine Learning Research, 17\n4809\u20134874.\n\n[16] Farahmand, A.-m., Szepesv\u00b4ari, C. and Munos, R. (2010). Error propagation for approximate\n\npolicy and value iteration. In Advances in Neural Information Processing Systems.\n\n[17] Haarnoja, T., Tang, H., Abbeel, P. and Levine, S. (2017). Reinforcement learning with deep\n\nenergy-based policies. In International Conference on Machine Learning.\n\n[18] Haarnoja, T., Zhou, A., Abbeel, P. and Levine, S. (2018).\n\nmaximum entropy deep reinforcement learning with a stochastic actor.\narXiv:1801.01290.\n\nSoft actor-critic: Off-policy\narXiv preprint\n\n[19] Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D. and Meger, D. (2018). Deep\n\nreinforcement learning that matters. In AAAI Conference on Arti\ufb01cial Intelligence.\n\n10\n\n\f[20] Hofmann, T., Sch\u00a8olkopf, B. and Smola, A. J. (2008). Kernel methods in machine learning.\n\nAnnals of Statistics 1171\u20131220.\n\n[21] Ilyas, A., Engstrom, L., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L. and Madry, A.\n(2018). Are deep policy gradient algorithms truly policy gradient algorithms? arXiv preprint\narXiv:1811.02553.\n\n[22] Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and gener-\n\nalization in neural networks. In Advances in Neural Information Processing Systems.\n\n[23] Kakade, S. (2002). A natural policy gradient. In Advances in Neural Information Processing\n\nSystems.\n\n[24] Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learn-\n\ning. In International Conference on Machine Learning.\n\n[25] Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in Neural In-\n\nformation Processing Systems.\n\n[26] Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J. and Pennington, J. (2019).\nWide neural networks of any depth evolve as linear models under gradient descent. arXiv\npreprint arXiv:1902.06720.\n\n[27] Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradi-\n\nent descent on structured data. In Advances in Neural Information Processing Systems.\n\n[28] Ling, Y., Hasan, S. A., Datla, V., Qadir, A., Lee, K., Liu, J. and Farri, O. (2017). Diagnostic\ninferencing via improving clinical concept extraction with deep reinforcement learning: A\npreliminary study. In Machine Learning for Healthcare Conference.\n\n[29] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. and\nKavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Interna-\ntional Conference on Machine Learning.\n\n[30] Munos, R. and Szepesv\u00b4ari, C. (2008). Finite-time bounds for \ufb01tted value iteration. Journal of\n\nMachine Learning Research, 9 815\u2013857.\n\n[31] Nemirovski, A. S. and Yudin, D. B. (1983). Problem Complexity and Method Ef\ufb01ciency in\n\nOptimization. Springer.\n\n[32] Nesterov, Y. (2013). Introductory Lectures on Convex Optimization: A Basic Course, vol. 87.\n\nSpringer Science & Business Media.\n\n[33] Neu, G., Jonsson, A. and G\u00b4omez, V. (2017). A uni\ufb01ed view of entropy-regularized Markov\n\ndecision processes. arXiv preprint arXiv:1705.07798.\n\n[34] OpenAI (2019). OpenAI Five. https://openai.com/five/.\n[35] Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71 1180\u20131190.\n[36] Puterman, M. L. (2014). Markov Decision Processes: Discrete Stochastic Dynamic Program-\n\nming. John Wiley & Sons.\n\n[37] Rajeswaran, A., Lowrey, K., Todorov, E. V. and Kakade, S. M. (2017). Towards generalization\nand simplicity in continuous control. In Advances in Neural Information Processing Systems.\n\n[38] Sallab, A. E., Abdou, M., Perot, E. and Yogamani, S. (2017). Deep reinforcement learning\n\nframework for autonomous driving. Electronic Imaging, 2017 70\u201376.\n\n[39] Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P. (2015). Trust region policy\n\noptimization. In International Conference on Machine Learning.\n\n[40] Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347.\n\n11\n\n\f[41] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine\n\nLearning, 3 9\u201344.\n\n[42] Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT press.\n\n[43] Sutton, R. S., McAllester, D. A., Singh, S. P. and Mansour, Y. (2000). Policy gradient methods\nfor reinforcement learning with function approximation. In Advances in Neural Information\nProcessing Systems.\n\n[44] Szepesv\u00b4ari, C. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Arti\ufb01cial\n\nIntelligence and Machine Learning, 4 1\u2013103.\n\n[45] Tosatto, S., Pirotta, M., D\u2019Eramo, C. and Restelli, M. (2017). Boosted \ufb01tted Q-iteration. In\n\nInternational Conference on Machine Learning.\n\n[46] Wagner, P. (2011). A reinterpretation of the policy oscillation phenomenon in approximate\n\npolicy iteration. In Advances in Neural Information Processing Systems.\n\n[47] Wagner, P. (2013). Optimistic policy iteration and natural actor-critic: A unifying view and a\n\nnon-optimality result. In Advances in Neural Information Processing Systems.\n\n[48] Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global\n\noptimality and rates of convergence. arXiv preprint arXiv:1909.01150.\n\n[49] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist rein-\n\nforcement learning. Machine Learning, 8 229\u2013256.\n\n[50] Yang, L. F. and Wang, M. (2019). Sample-optimal parametric Q-learning with linear transition\n\nmodels. arXiv preprint arXiv:1902.04779.\n\n[51] Yang, Z., Xie, Y. and Wang, Z. (2019). A theoretical analysis of deep Q-learning. arXiv\n\npreprint arXiv:1901.00137.\n\n[52] Zhang, K., Koppel, A., Zhu, H. and Bas\u00b8ar, T. (2019). Global convergence of policy gradient\n\nmethods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383.\n\n[53] Zhou, M., Liu, T., Li, Y., Lin, D., Zhou, E. and Zhao, T. (2019). Toward understanding the\nIn International Conference on Machine\n\nimportance of noise in training neural networks.\nLearning.\n\n[54] Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2018). Stochastic gradient descent optimizes over-\n\nparameterized deep ReLU networks. arXiv preprint arXiv:1811.08888.\n\n12\n\n\f", "award": [], "sourceid": 5585, "authors": [{"given_name": "Boyi", "family_name": "Liu", "institution": "Northwestern University"}, {"given_name": "Qi", "family_name": "Cai", "institution": "Northwestern University"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Northwestern University"}]}