{"title": "Neural Temporal-Difference Learning Converges to Global Optima", "book": "Advances in Neural Information Processing Systems", "page_first": 11315, "page_last": 11326, "abstract": "Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.", "full_text": "Neural Temporal-Difference Learning\n\nConverges to Global Optima\n\nQi Cai \u21e4\n\nZhuoran Yang \u2020\n\nJason D. Lee \u2021\n\nZhaoran Wang \u21e4\n\nAbstract\n\nTemporal-difference learning (TD), coupled with neural networks, is among the\nmost fundamental building blocks of deep reinforcement learning. However, due\nto the nonlinearity in value function approximation, such a coupling leads to non-\nconvexity and even divergence in optimization. As a result, the global convergence\nof neural TD remains unclear. In this paper, we prove for the \ufb01rst time that neural\nTD converges at a sublinear rate to the global optimum of the mean-squared pro-\njected Bellman error for policy evaluation. In particular, we show how such global\nconvergence is enabled by the overparametrization of neural networks, which also\nplays a vital role in the empirical success of neural TD.1\n\n1\n\nIntroduction\n\nGiven a policy, temporal-different learning (TD) [49] aims to learn the corresponding (action-\n)value function by following the semigradients of the mean-squared Bellman error in an online\nmanner. As the most-used policy evaluation algorithm, TD serves as the \u201ccritic\u201d component of many\nreinforcement learning algorithms, such as the actor-critic algorithm [31] and trust-region policy\noptimization [47]. In particular, in deep reinforcement learning, TD is often applied to learn value\nfunctions parametrized by neural networks [36, 39, 24], which gives rise to neural TD. As policy\nimprovement relies crucially on policy evaluation, the optimization ef\ufb01ciency and statistical accuracy\nof neural TD are critical to the performance of deep reinforcement learning. Towards theoretically\nunderstanding deep reinforcement learning, the goal of this paper is to characterize the convergence\nof neural TD.\nDespite the broad applications of neural TD, its convergence remains rarely understood. Even\nwith linear value function approximation, the nonasymptotic convergence of TD remains open until\nrecently [6, 33, 14, 48, 45], although its asymptotic convergence is well understood [28, 55, 9, 32,\n8]. Meanwhile, with nonlinear value function approximation, TD is known to diverge in general\n[4, 11, 55]. To remedy this issue, [7] propose nonlinear (gradient) TD, which uses the tangent\nvectors of nonlinear value functions in place of the feature vectors in linear TD. Unlike linear TD,\nwhich converges to the global optimum of the mean-squared projected Bellman error (MSPBE),\nnonlinear TD is only guaranteed to converge to a local optimum asymptotically. As a result, the\nstatistical accuracy of the value function learned by nonlinear TD remains unclear. In contrast to such\nconservative theory, neural TD, which straightforwardly combines TD with neural networks without\nthe explicit local linearization in nonlinear TD, often learns a desired value function that generalizes\nwell to unseen states in practice [18, 2, 26]. Hence, a gap separates theory from practice.\n\n\u21e4Department of Industrial Engineering and Management Sciences, Northwestern University\n\u2020Department of Operations Research and Financial Engineering, Princeton University\n\u2021Department of Electronic Engineering, Princeton University\n1Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further\nconnected to that of policy gradient algorithms. See https://arxiv.org/abs/1905.10027 for the full\nversion.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThere exist three obstacles towards closing such a theory-practice gap: (i) MSPBE has an expectation\nover the transition dynamics within the squared loss, which forbids the construction of unbiased\nstochastic gradients [50]. As a result, even with linear value function approximation, TD largely\neludes the classical optimization framework, as it follows biased stochastic semigradients. (ii) When\nthe value function is parametrized by a neural network, MSPBE is nonconvex in the weights of the\nneural network, which may introduce undesired stationary points such as local optima and saddle\npoints [30]. As a result, even an ideal algorithm that follows the population gradients of MSPBE may\nget trapped. (iii) Due to the interplay between the bias in stochastic semigradients and the nonlinearity\nin value function approximation, neural TD may even diverge [4, 11, 55], instead of converging to\nan undesired stationary point, as it lacks the explicit local linearization in nonlinear TD [7]. Such\ndivergence is also not captured by the classical optimization framework.\n\nContribution. Towards bridging theory and practice, we establish the \ufb01rst nonasymptotic global rate\nof convergence of neural TD. In detail, we prove that randomly initialized neural TD converges to the\nglobal optimum of MSPBE at the rate of 1/T with population semigradients and at the rate of 1/pT\nwith stochastic semigradients. Here T is the number of iterations and the (action-)value function is\nparametrized by a suf\ufb01ciently wide two-layer neural network. Moreover, we prove that the projection\nin MSPBE allows for a suf\ufb01ciently rich class of functions, which has the same representation power\nof a reproducing kernel Hilbert space associated with the random initialization. As a result, for a\nbroad class of reinforcement learning problems, neural TD attains zero MSPBE.\nAt the core of our analysis is the overparametrization of the two-layer neural network for value\nfunction approximation [59, 41, 1, 3], which enables us to circumvent the three obstacles above. In\nparticular, overparametrization leads to an implicit local linearization that varies smoothly along the\nsolution path, which mirrors the explicit one in nonlinear TD [7]. Such an implicit local linearization\nenables us to circumvent the third obstacle of possible divergence. Moreover, overparametrization\nallows us to establish a notion of one-point monotonicity [25, 19] for the semigradients followed by\nneural TD, which ensures its evolution towards the global optimum of MSPBE along the solution\npath. Such a notion of monotonicity enables us to circumvent the \ufb01rst and second obstacles of bias\nand nonconvexity. Broadly speaking, our theory backs the empirical success of overparametrized\nneural networks in deep reinforcement learning. In particular, we show that instead of being a curse,\noverparametrization is indeed a blessing for minimizing MSPBE in the presence of bias, nonconvexity,\nand even divergence.\n\nMore Related Work. There is a large body of literature on the convergence of linear TD under\nboth asymptotic [28, 55, 9, 32, 8] and nonasymptotic [6, 33, 14, 48] regimes. See [16] for a detailed\nsurvey. In particular, our analysis is based on the recent breakthrough in the nonasymptotic analysis\nof linear TD [6] and its extension to linear Q-learning [60]. An essential step of our analysis is\nbridging the evolution of linear TD and neural TD through the implicit local linearization induced by\noverparametrization.\nTo incorporate nonlinear value function approximation into TD, [7] propose the \ufb01rst convergent\nnonlinear TD based on explicit local linearization, which however only converges to a local optimum\nof MSPBE. See [21, 5] for a detailed survey. In contrast, we prove that, with the implicit local\nlinearization induced by overparametrization, neural TD, which is simpler to implement and more\nwidely used in deep reinforcement learning than nonlinear TD, provably converges to the global\noptimum of MSPBE.\nThere exist various extensions of TD, including least-squares TD [12, 10, 34, 22, 56] and gradient\nTD [51, 52, 7, 37, 17, 57, 54]. In detail, least-squares TD is based on batch update, which loses the\ncomputational and statistical ef\ufb01ciency of the online update in TD. Meanwhile, gradient TD follows\nunbiased stochastic gradients, but at the cost of introducing another optimization variable. Such a\nreformulation leads to bilevel optimization, which is less stable in practice when combined with\nneural networks [42]. As a result, both extensions of TD are less widely used in deep reinforcement\nlearning [18, 2, 26]. Moreover, when using neural networks for value function approximation, the\nconvergence to the global optimum of MSPBE remains unclear for both extensions of TD.\nOur work is also related to the recent breakthrough in understanding overparametrized neural\nnetworks, especially their generalization error [59, 41, 1, 3]. See [20] for a detailed survey. In\nparticular, [15, 1, 3, 13, 29, 35] characterize the implicit local linearization in the context of supervised\nlearning, where we train an overparametrized neural network by following the stochastic gradients\n\n2\n\n\fof the mean-squared error. In contrast, neural TD does not follow the stochastic gradients of any\nobjective function, hence leading to possible divergence, which makes the convergence analysis more\nchallenging.\n\n2 Background\n\nIn Section 2.1, we brie\ufb02y review policy evaluation in reinforcement learning. In Section 2.2, we\nintroduce the corresponding optimization formulations.\n\n2.1 Policy Evaluation\nWe consider a Markov decision process (S,A,P, r, ), in which an agent interacts with the environ-\nment to learn the optimal policy that maximizes the expected total reward. At the t-th time step, the\nagent has a state st 2S and takes an action at 2A . Upon taking the action, the agent enters the\nnext state st+1 2S according to the transition probability P(\u00b7| st, at) and receives a random reward\nrt = r(st, at) from the environment. The action that the agent takes at each state is decided by a\npolicy \u21e1 : S! , where  is the set of all probability distributions over A. The performance of\npolicy \u21e1 is measured by the expected total reward, J(\u21e1) = E[P1t=0 trt | at \u21e0 \u21e1(st)], where < 1\nis the discount factor.\nGiven policy \u21e1, policy evaluation aims to learn the following two functions, the value function\nV \u21e1(s) = E[P1t=0 trt | s0 = s, at \u21e0 \u21e1(st)] and the action-value function (Q-function) Q\u21e1(s, a) =\nE[P1t=0 trt | s0 = s, a0 = a, at \u21e0 \u21e1(st)]. Both functions form the basis for policy improvement.\n\nWithout loss of generality, we focus on learning the Q-function in this paper. We de\ufb01ne the Bellman\nevaluation operator,\n\nT \u21e1Q(s, a) = E[r(s, a) + Q(s0, a0)| s0 \u21e0P (\u00b7| s, a), a0 \u21e0 \u21e1(s0)],\n\n(2.1)\n\nfor which Q\u21e1 is the \ufb01xed point, that is, the solution to the Bellman equation Q = T \u21e1Q.\n2.2 Optimization Formulation\nCorresponding to (2.1), we aim to learn Q\u21e1 by minimizing the mean-squared Bellman error (MSBE),\n\n(2.2)\n\nmin\n\n\u2713\n\nMSBE(\u2713) = E(s,a)\u21e0\u00b5\u21e5bQ\u2713(s, a) T \u21e1bQ\u2713(s, a)2\u21e4,\n\nmin\n\n\u2713\n\nwhere the Q-function is parametrized as bQ\u2713 with parameter \u2713. Here \u00b5 is the stationary distribution\n\nof (s, a) corresponding to policy \u21e1. Due to Q-function approximation, we focus on minimizing the\nfollowing surrogate of MSBE, namely the projected mean-squared Bellman error (MSPBE),\n\nMSPBE(\u2713) = E(s,a)\u21e0\u00b5\u21e5bQ\u2713(s, a)  \u21e7FT \u21e1bQ\u2713(s, a)2\u21e4.\n(2.3)\nHere \u21e7F is the projection onto a function class F. For example, for linear Q-function approximation\n[49], F takes the form {bQ\u27130\n: \u27130 2 \u21e5}, where bQ\u27130 is linear in \u27130 and \u21e5 is the set of feasible\nparameters. As another example, for nonlinear Q-function approximation [7], F takes the form\n{bQ\u2713 + r\u2713bQ>\u2713 (\u27130  \u2713) : \u27130 2 \u21e5}, which consists of the local linearization of bQ\u27130 at \u2713.\n\nThroughout this paper, we assume that we are able to sample tuples in the form of (s, a, r, s0, a0)\nfrom the stationary distribution of policy \u21e1 in an independent and identically distributed manner,\nalthough our analysis can be extended to handle temporal dependence using the proof techniques of\n[6]. With a slight abuse of notation, we use \u00b5 to denote the stationary distribution of (s, a, r, s0, a0)\ncorresponding to policy \u21e1 and any of its marginal distributions.\n\n3 Neural Temporal-Difference Learning\n\nTD updates the parameter \u2713 of the Q-function by taking the stochastic semigradient descent step\n[49, 53, 50],\n\n\u27130 \u2713  \u2318 \u00b7bQ\u2713(s, a)  r(s, a)  bQ\u2713(s0, a0) \u00b7r \u2713bQ\u2713(s, a),\n\n3\n\n(3.1)\n\n\fwhich corresponds to the MSBE in (2.2). Here (s, a, r, s0, a0) \u21e0 \u00b5 and \u2318> 0 is the stepsize. In\na more general context, (3.1) is referred to as TD(0). In this paper, we focus on TD(0), which is\nabbreviated as TD, and leave the extension to TD() to future work.\nIn the sequel, we denote the state-action pair (s, a) 2S\u21e5A by a vector x 2X\u2713 Rd with d > 2. We\nconsider S to be continuous and A to be \ufb01nite. Without loss of generality, we assume that kxk2 = 1\nand |r(x)| is upper bounded by a constant r for any x 2X . We use a two-layer neural network\n\nbr(W >r x)\n\n(3.2)\n\nbQ(x; W ) =\n\n1\npm\n\nmXr=1\n\nto parametrize the Q-function. Here  is the recti\ufb01ed linear unit (ReLU) activation function (y) =\nmax{0, y} and the parameter \u2713 = (b1, . . . , bm, W1, . . . , Wm) are initialized as br \u21e0 Unif({1, 1})\nand Wr \u21e0 N (0, Id/d) for any r 2 [m] independently. During training, we only update W =\n(W1, . . . , Wm) 2 Rmd, while keeping b = (b1, . . . , bm) 2 Rm \ufb01xed as the random initialization.\nTo ensure global convergence, we incorporate an additional projection step with respect to W . See\nAlgorithm 1 for a detailed description.\n\nt+2 \u00b7 W + 1\n\nt+2 \u00b7 W (t + 1)\n\nInitialization: SB = {W 2 Rmd : kW  W (0)k2 \uf8ff B} (B > 0)\n\nSample a tuple (s, a, r, s0, a0) from the stationary distribution \u00b5 of policy \u21e1\nLet x = (s, a), x0 = (s0, a0)\n\nAlgorithm 1 Neural TD\n1: Initialization: br \u21e0 Unif({1, 1}), Wr(0) \u21e0 N (0, Id/d) (r 2 [m]), W = W (0),\n2: For t = 0 to T  2:\n3:\n4:\n5:\n6:\n7:\n8:\n9: End For\n\nBellman residual calculation:  bQ(x; W (t))  r  bQ(x0; W (t))\nTD update:fW (t + 1) W (t)  \u2318 \u00b7r WbQ(x; W (t))\nProjection: W (t + 1) argminW2SB kW fW (t + 1)k2\nAveraging: W t+1\n10: Output: bQout(\u00b7) bQ(\u00b7 ; W )\nE(s,a,r,s0,a0)\u21e0\u00b5\u21e5bQ\u2713(s, a)  r(s, a)  bQ\u2713(s0, a0) \u00b7r \u2713bQ\u2713(s, a)\u21e4\n= E(s,a)\u21e0\u00b5\u21e5bQ\u2713(s, a)  E[r(s, a) + Q(s0, a0)| s0 \u21e0P (\u00b7| s, a), a0 \u21e0 \u21e1(s0)] \u00b7r \u2713bQ\u2713(s, a)\u21e4\n= E(s,a)\u21e0\u00b5\u21e5bQ\u2713(s, a) T \u21e1bQ\u2713(s, a)\n}\n\nTo understand the intuition behind the global convergence of neural TD, note that for the TD update\nin (3.1), we have from (2.1) that\n\nHere (i) is the Bellman residual at (s, a), while (ii) is the gradient of the \ufb01rst term in (i). Although the\nTD update in (3.1) resembles the stochastic gradient descent step for minimizing a mean-squared\nerror, it is not an unbiased stochastic gradient of any objective function. However, we show that the\nTD update yields a descent direction towards the global optimum of the MSPBE in (2.3). Moreover,\nas the neural network becomes wider, the function class F that \u21e7F projects onto in (2.3) becomes\nricher. Correspondingly, the MSPBE reduces to the MSBE in (2.2) as the projection becomes closer\nto identity, which implies the recovery of the desired Q-function Q\u21e1 such that Q\u21e1 = T \u21e1Q\u21e1. See\nSection 4 for a more rigorous characterization.\n\n\u21e4.\n\u00b7r \u2713bQ\u2713(s, a)\n}\n|\n\n{z\n\n(ii)\n\n(3.3)\n\n|\n\n(i)\n\n{z\n\n4 Main Results\n\nIn Section 4.1, we characterize the global optimality of the stationary point attained by Algorithm 1\nin terms of minimizing the MSPBE in (2.3) and its other properties. In Section 4.2, we establish the\nnonasymptotic global rates of convergence of neural TD to the global optimum of the MSPBE when\nfollowing the population semigradients in (3.3) and the stochastic semigradients in (3.1), respectively.\nWe use the subscript E\u00b5[\u00b7] to denote the expectation over the randomness of the tuple (s, a, r, s, a0)\n(or its concise form (x, r, x0)) conditional on all other randomness, e.g., the random initialization\n\n4\n\n\fand the random current iterate. Meanwhile, we use the subscript Einit,\u00b5[\u00b7] when we are taking the\nexpectation over all randomness, including the random initialization.\n\n4.1 Properties of Stationary Point\nWe consider the population version of the TD update in Line 6 of Algorithm 1,\n\nfW (t + 1) W (t)  \u2318 \u00b7 E\u00b5\u21e5x, r, x0; W (t) \u00b7r WbQx; W (t)\u21e4,\n\nwhere \u00b5 is the stationary distribution and (x, r, x0; W (t)) = bQ(x; W (t)) r  bQ(x0; W (t)) is the\n\nBellman residual at (x, r, x0). The stationary point W \u2020 of (4.1) satis\ufb01es the following stationarity\ncondition,\n\n(4.1)\n\nAlso, note that\n\nE\u00b5[(x, r, x0; W \u2020) \u00b7r WbQ(x; W \u2020)]>(W  W \u2020)  0,\nbQ(x; W ) =\n\nmXr=1\nand rWrbQ(x; W ) = br 1{W >r x > 0}x almost everywhere in Rmd. Meanwhile, recall that SB =\n{W 2 Rmd : kW  W (0)k2 \uf8ff B}. We de\ufb01ne the function class\n\nfor any W 2 SB.\n\n(4.2)\n\nbr 1{W >r x > 0}W >r x\n\nbr(W >r x) =\n\nmXr=1\n\n1\npm\n\n1\npm\n\nF\u2020B,m =\u21e2 1\n\npm\n\nmXr=1\n\nbr 1{(W \u2020r )>x > 0}W >r x : W 2 SB,\n\n(4.3)\n\nequivalent form\n\n\u2326bQ(\u00b7 ; W \u2020) T \u21e1bQ(\u00b7 ; W \u2020), f (\u00b7)  bQ(\u00b7 ; W \u2020)\u21b5\u00b5  0,\n\nwhich consists of the local linearization of bQ(x; W ) at W = W \u2020. Then (4.2) takes the following\nwhich implies bQ(\u00b7 ; W \u2020) =\u21e7\nF\u2020B,mT \u21e1bQ(\u00b7 ; W \u2020) by the de\ufb01nition of the projection induced by h\u00b7,\u00b7i\u00b5.\nBy (2.3), bQ(\u00b7 ; W \u2020) is the global optimum of the MSPBE that corresponds to the projection onto\nF\u2020B,m.\nIntuitively, when using an overparametrized neural network with width m ! 1, the average\nvariation in each Wr diminishes to zero. Hence, roughly speaking, we have 1{Wr(t)>x > 0} =\n1{Wr(0)>x > 0} with high probability for any t 2 [T ]. As a result, the function class F\u2020B,m de\ufb01ned\nin (4.3) approximates\n\nfor any f 2F \u2020B,m,\n\n(4.4)\n\nFB,m =\u21e2 1\n\npm\n\nmXr=1\n\nbr 1{Wr(0)>x > 0}W >r x : W 2 SB.\n\n(4.5)\n\nIn the sequel, we show that, to characterize the global convergence of Algorithm 1 with a suf\ufb01ciently\nlarge m, it suf\ufb01ces to consider FB,m in place of F\u2020B,m, which simpli\ufb01es the analysis, since the\ndistribution of W (0) is given. To this end, we de\ufb01ne the approximate stationary point W \u21e4 with\nrespect to the function class FB,m de\ufb01ned in (4.5).\nDe\ufb01nition 4.1 (Approximate Stationary Point W \u21e4). If W \u21e4 = (W \u21e41 , . . . , W \u21e4m) 2 SB satis\ufb01es\n\nwhere we de\ufb01ne\n\nE\u00b5[0(x, r, x0; W \u21e4) \u00b7r WbQ0(x; W \u21e4)]>(W  W \u21e4)  0,\n\nfor any W 2 SB,\n\n(4.8)\nthen we say that W \u21e4 is an approximate stationary point of the population update in (4.1). Here W \u21e4\ndepends on the random initialization b = (b1, . . . , bm) and W (0) = (W1(0), . . . , Wm(0)).\n\nbQ0(x; W ) =\nbr 1{Wr(0)>x > 0}W >r x,\n0(x, r, x0; W ) = bQ0(x; W )  r  bQ0(x0; W ),\n\n1\npm\n\nmXr=1\n\n(4.6)\n\n(4.7)\n\n5\n\n\fThe next lemma proves that such an approximate stationary point always exists, since it corresponds\nto the \ufb01xed point of the operator \u21e7FB,mT \u21e1, which is a contraction in the `2-norm associated with\nthe stationary distribution \u00b5.\nLemma 4.2 (Existence and Optimality of W \u21e4). There exists an approximate stationary point W \u21e4 for\nany b 2 {1, 1}m and W (0) 2 Rmd. Also, bQ0(\u00b7 ; W \u21e4) is the global optimum of the MSPBE that\ncorresponds to the projection onto FB,m in (4.5).\nProof. See Appendix B.1 for a detailed proof.\n\n4.2 Global Convergence\nIn this section, we establish the main results on the global convergence of neural TD in Algorithm 1.\nWe \ufb01rst lay out the following regularity condition on the stationary distribution \u00b5.\nAssumption 4.3 (Regularity of Stationary Distribution \u00b5). There exists a constant c0 > 0 such that\nfor any \u2327  0 and w \u21e0 N (0, Id/d), it holds almost surely that\n\n(4.9)\n\nE\u00b5\u21e51{|w>x|\uf8ff \u2327} w\u21e4 \uf8ff c0 \u00b7 \u2327/kwk2.\n\nAssumption 4.3 regularizes the density of \u00b5 in terms of the marginal distribution of x. In particular, it\nis straightforwardly implied when the density of \u00b5 in terms of state s is upper bounded.\n\nPopulation Update: The next theorem establishes the nonasymptotic global rate of convergence of\nneural TD when it follows population semigradients. Recall that the approximate stationary point W \u21e4\n\nand bQ0(\u00b7 ; W \u21e4) are de\ufb01ned in De\ufb01nition 4.1. Also, B is the radius of the set of feasible W , which is\nde\ufb01ned in Algorithm 1, T is the number of iterations,  is the discount factor, and m is the width of\nthe neural network in (3.2).\nTheorem 4.4 (Convergence of Population Update). We set \u2318 = (1)/8 in Algorithm 1 and replace\nthe TD update in Line 6 by the population update in (4.1). Under Assumption 4.3, the output bQout of\nAlgorithm 1 satis\ufb01es\n\n+ O(B3m1/2 + B5/2m1/4),\n\nEinit,\u00b5\u21e5bQout(x)  bQ0(x; W \u21e4)2\u21e4 \uf8ff\n\n16B2\n(1  )2T\n\nwhere the expectation is taken with respect to all randomness, including the random initialization and\nthe stationary distribution \u00b5.\n\nProof. The key to the proof of Theorem 4.4 is the one-point monotonicity of the population semigra-\n\nC.5 for a detailed proof.\n\ndient g(t), which is established through the local linearization bQ0(x; W ) of bQ(x; W ). See Appendix\n\nStochastic Update: To further prove the global convergence of neural TD when it follows stochastic\nsemigradients, we \ufb01rst establish an upper bound of their variance, which affects the choice of the\nstepsize \u2318. For notational simplicity, we de\ufb01ne the stochastic and population semigradients as\n\ng(t) = x, r, x0; W (t) \u00b7r WbQx; W (t),\n\nLemma 4.5 (Variance Bound). There exists 2\nsemigradient is upper bounded as Einit,\u00b5[kg(t)  g(t)k2\nProof. See Appendix B.2 for a detailed proof.\n\n(4.10)\ng = O(B2) such that the variance of the stochastic\n\ng(t) = E\u00b5[g(t)].\n\n2] \uf8ff 2\n\ng for any t 2 [T ].\n\nBased on Theorem 4.4 and Lemma 4.5, we establish the global convergence of neural TD in Algorithm\n1.\n\nTheorem 4.6 (Convergence of Stochastic Update). We set \u2318 = min{(1)/8, 1/pT} in Algorithm\n1. Under Assumption 4.3, the output bQout of Algorithm 1 satis\ufb01es\n\n+ O(B3m1/2 + B5/2m1/4).\n\n16(B2 + 2\ng)\n\nEinit,\u00b5\u21e5bQout(x)  bQ0(x; W \u21e4)2\u21e4 \uf8ff\n\n(1  )2pT\n\n6\n\n\fProof. See Appendix C.6 for a detailed proof.\n\nAs the width of the neural network m ! 1, Lemma 4.2 implies that bQ0(\u00b7 ; W \u21e4) is the global\noptimum of the MSPBE in (2.3) with a richer function class FB,1 to project onto. In fact, the\nfunction class FB,1  bQ(\u00b7 ; W (0)) is a subset of an RKHS with H-norm upper bounded by B.\nHere bQ(\u00b7 ; W (0)) is de\ufb01ned in (3.2). See Appendix A.2 for a more detailed discussion on the\nrepresentation power of FB,1. Therefore, if the desired Q-function Q\u21e1(\u00b7) falls into FB,1, it is the\nglobal optimum of the MSPBE. In such a case, by Lemma 4.2 and Theorem 4.6, we approximately\nobtain Q\u21e1(\u00b7) = bQ0(\u00b7 ; W \u21e4) through bQout(\u00b7).\nMore generally, the following proposition quanti\ufb01es the distance between bQ0(\u00b7 ; W \u21e4) and Q\u21e1(\u00b7) in\nthe case that Q\u21e1(\u00b7) does not fall into the function class FB,m. In particular, it states that the `2-norm\ndistance kbQ0(\u00b7 ; W \u21e4)  Q\u21e1(\u00b7)k\u00b5 is upper bounded by the distance between Q\u21e1(\u00b7) and FB,m.\nProposition 4.7 (Convergence of Stochastic Update to Q\u21e1). It holds that kbQ0(\u00b7 ; W \u21e4)  Q\u21e1(\u00b7)k\u00b5 \uf8ff\n(1  )1 \u00b7k \u21e7FB,mQ\u21e1(\u00b7)  Q\u21e1(\u00b7)k\u00b5, which by Theorem 4.6 implies\n2Einit,\u00b5\u21e5\u21e7FB,mQ\u21e1(x)  Q\u21e1(x)2\u21e4\n\n32(B2 + 2\ng)\n\nEinit,\u00b5\u21e5bQout(x)  Q\u21e1(x)2\u21e4 \uf8ff\n\n+ O(B3m1/2 + B5/2m1/4).\n\n(1  )2pT\n\n+\n\n(1  )2\n\nProof. See Appendix B.3 for a detailed proof.\n\nProposition 4.7 implies that if Q\u21e1(\u00b7) 2F B,1, then bQout(\u00b7) ! Q\u21e1(\u00b7) as T, m ! 1. In other words,\n\nneural TD converges to the global optimum of the MSPBE in (2.3), or equivalently, the MSBE in\n(2.2), both of which have objective value zero.\n\n5 Proof Sketch\n\nIn the sequel, we sketch the proofs of Theorems 4.4 and 4.6 in Section 4.\n\n5.1\n\nImplicit Local Linearization via Overparametrization\n\nRecall that as de\ufb01ned in (4.7), bQ0(x; W ) takes the form\n\nbQ0(x; W ) =( x)>W,\n\nwhere (x) =\n\n1\n\npm \u00b71{W1(0)>x > 0}x, . . . , 1{Wm(0)>x > 0}x 2 Rmd,\n\nwhich is linear in the feature map (x). In other words, with respect to W , bQ0(x; W ) linearizes the\nneural network bQ(x; W ) de\ufb01ned in (3.2) locally at W (0). The following lemma characterizes the\ndifference between bQ(x; W (t)), which is along the solution path of neural TD in Algorithm 1, and\nits local linearization bQ0(x; W (t)). In particular, we show that the error of such a local linearization\ndiminishes to zero as m ! 1. For notational simplicity, we use bQt(x) to denote bQ(x; W (t)) in\nthe sequel. Note that by (4.7) we have bQ0(x) = bQ(x; W (0)) = bQ0(x; W (0)). Recall that B is the\nradius of the set of feasible W in (4.5).\nLemma 5.1 (Local Linearization of Q-Function). There exists a constant c1 > 0 such that for any\nt 2 [T ], it holds that\n\nEinit,\u00b5hbQt(x)  bQ0x; W (t)2i \uf8ff 4c1B3 \u00b7 m1/2.\n\nProof. See Appendix C.1 for a detailed proof.\n\n7\n\n\fAs a direct consequence of Lemma 5.1, the next lemma characterizes the effect of local linearization\non population semigradients. Recall that g(t) is de\ufb01ned in (4.10). We denote by g0(t) the locally\n\nlinearized population semigradient, which is de\ufb01ned by replacing bQt(x) in g(t) with its local\nlinearization bQ0(x; W (t)). In other words, by (4.10), (4.7), and (4.8), we have\ng(t) = E\u00b5\u21e5x, r, x0; W (t) \u00b7r WbQx; W (t)\u21e4,\ng0(t) = E\u00b5\u21e50x, r, x0; W (t) \u00b7r WbQ0x; W (t)\u21e4.\n\n(5.1)\n(5.2)\nLemma 5.2 (Local Linearization of Semigradient). Let r be the upper bound of the reward r(x) for\nany x 2X . There exists a constant c2 > 0 such that for any t 2 [T ], it holds that\n2\u21e4 \uf8ff (56c1B3 + 24c2B + 6c1Br2) \u00b7 m1/2.\n\nEinit\u21e5kg(t)  g0(t)k2\n\nProof. See Appendix C.2 for a detailed proof.\n\nLemmas 5.1 and 5.2 show that the error of local linearization diminishes as the degree of over-\nparametrization increases along m. As a result, we do not require the explicit local linearization in\nnonlinear TD [7]. Instead, we show that such an implicit local linearization suf\ufb01ces to ensure the\nglobal convergence of neural TD.\n\n5.2 Proofs for Population Update\n\nThe characterization of the locally linearized Q-function in Lemma 5.1 and the locally linearized\npopulation semigradients in Lemma 5.2 allows us to establish the following descent lemma, which\nextends Lemma 3 of [6] for characterizing linear TD.\nLemma 5.3 (Population Descent Lemma). For {W (t)}t2[T ] in Algorithm 1 with the TD update in\nLine 6 replaced by the population update in (4.1), it holds that\n\nkW (t + 1)  W \u21e4k2\n\n2 \uf8ff kW (t)  W \u21e4k2\n\n2 2\u2318(1  )  8\u23182 \u00b7 E\u00b5h\u21e3bQ0x; W (t)  bQ0(x; W \u21e4)\u23182i\n\n2 + 2\u2318B \u00b7k g(t)  g0(t)k2\n\n.\n\nError of Local Linearization\n\n+ 2\u23182 \u00b7k g(t)  g0(t)k2\n|\n\n{z\n\n}\n\nProof. See Appendix C.3 for a detailed proof.\n\nLemma 5.3 shows that, with a suf\ufb01ciently small stepsize \u2318, kW (t)  W \u21e4k2 decays at each iteration\nup to the error of local linearization, which is characterized by Lemma 5.2. By combining Lemmas\n5.2 and 5.3 and further plugging them into a telescoping sum, we establish the convergence of bQout(\u00b7)\nto the global optimum bQ0(\u00b7 ; W \u21e4) of the MSPBE. See Appendix C.5 for a detailed proof.\n\n5.3 Proofs for Stochastic Update\n\nRecall that the stochastic semigradient g(t) is de\ufb01ned in (4.10). In parallel with Lemma 5.3, the\nfollowing lemma additionally characterizes the effect of the variance of g(t), which is induced by\nthe randomness of the current tuple (x, r, x0). We use the subscript EW [\u00b7] to denote the expectation\nover the randomness of the current iterate W (t) conditional on the random initialization b and W (0).\nCorrespondingly, EW,\u00b5[\u00b7] is over the randomness of both the current tuple (x, r, x0) and the current\niterate W (t) conditional on the random initialization.\nLemma 5.4 (Stochastic Descent Lemma). For {W (t)}t2[T ] in Algorithm 1, it holds that\nEW,\u00b5\u21e5kW (t + 1)  W \u21e4k2\n2\u21e4\n2\u21e4 2\u2318(1  )  8\u23182 \u00b7 EW,\u00b5h\u21e3bQ0x; W (t)  bQ0(x; W \u21e4)\u23182i\n\uf8ff EW\u21e5kW (t)  W \u21e4k2\n2\u21e4\n+ EW,\u00b5\u21e5\u23182 \u00b7k g(t)  g(t)k2\n|\n}\n\n2 + 2\u2318B \u00b7k g(t)  g0(t)k2\u21e4\n+ EW\u21e52\u23182 \u00b7k g(t)  g0(t)k2\n{z\n|\n}\n\nError of Local Linearization\n\nVariance of Semigradient\n\n{z\n\n.\n\n8\n\n\fProof. See Appendix C.4 for a detailed proof.\n\nTo ensure the global convergence of neural TD in the presence of the variance of g(t), we rescale\nthe stepsize to be of order T 1/2. The rest proof of Theorem 4.6 mirrors that of Theorem 4.4. See\nAppendix C.6 for a detailed proof.\n\n6 Conclusions\n\nIn this paper we prove that neural TD converges at a sublinear rate to the global optimum of the\nMSPBE for policy evaluation. In particular, we show how such global convergence is enabled by the\noverparametrization of neural networks. Our results shed new light on the theoretical understanding\nof RL with neural networks, which is widely employed in practice.\n\nReferences\n[1] Allen-Zhu, Z., Li, Y. and Liang, Y. (2018). Learning and generalization in overparameterized\n\nneural networks, going beyond two layers. arXiv preprint arXiv:1811.04918.\n\n[2] Amiranashvili, A., Dosovitskiy, A., Koltun, V. and Brox, T. (2018). TD or not TD: Ana-\nlyzing the role of temporal differencing in deep reinforcement learning. arXiv preprint\narXiv:1806.01175.\n\n[3] Arora, S., Du, S. S., Hu, W., Li, Z. and Wang, R. (2019). Fine-grained analysis of optimiza-\ntion and generalization for overparameterized two-layer neural networks. arXiv preprint\narXiv:1901.08584.\n\n[4] Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation.\n\nIn International Conference on Machine Learning.\n\n[5] Bertsekas, D. P. (2019). Feature-based aggregation and deep reinforcement learning: A survey\n\nand some new implementations. IEEE/CAA Journal of Automatica Sinica, 6 1\u201331.\n\n[6] Bhandari, J., Russo, D. and Singal, R. (2018). A \ufb01nite time analysis of temporal difference\n\nlearning with linear function approximation. arXiv preprint arXiv:1806.02450.\n\n[7] Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R. and Szepesv\u00b4ari, C. (2009).\nConvergent temporal-difference learning with arbitrary smooth function approximation. In\nAdvances in Neural Information Processing Systems.\n\n[8] Borkar, V. S. (2009). Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48.\n\nSpringer.\n\n[9] Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approxi-\nmation and reinforcement learning. SIAM Journal on Control and Optimization, 38 447\u2013469.\n\n[10] Boyan, J. A. (1999). Least-squares temporal difference learning. In International Conference\n\non Machine Learning.\n\n[11] Boyan, J. A. and Moore, A. W. (1995). Generalization in reinforcement learning: Safely ap-\n\nproximating the value function. In Advances in Neural Information Processing Systems.\n\n[12] Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference\n\nlearning. Machine Learning, 22 33\u201357.\n\n[13] Chizat, L. and Bach, F. (2018). A note on lazy training in supervised differentiable programming.\n\narXiv preprint arXiv:1812.07956.\n\n[14] Dalal, G., Sz\u00a8or\u00b4enyi, B., Thoppe, G. and Mannor, S. (2018). Finite sample analyses for TD(0)\n\nwith function approximation. In AAAI Conference on Arti\ufb01cial Intelligence.\n\n9\n\n\f[15] Daniely, A. (2017). SGD learns the conjugate kernel class of the network. In Advances in\n\nNeural Information Processing Systems.\n\n[16] Dann, C., Neumann, G. and Peters, J. (2014). Policy evaluation with temporal differences: A\n\nsurvey and comparison. Journal of Machine Learning Research, 15 809\u2013883.\n\n[17] Du, S. S., Chen, J., Li, L., Xiao, L. and Zhou, D. (2017). Stochastic variance reduction methods\n\nfor policy evaluation. In International Conference on Machine Learning.\n\n[18] Duan, Y., Chen, X., Houthooft, R., Schulman, J. and Abbeel, P. (2016). Benchmarking deep\nreinforcement learning for continuous control. In International Conference on Machine Learn-\ning.\n\n[19] Facchinei, F. and Pang, J.-S. (2007). Finite-Dimensional Variational Inequalities and Comple-\n\nmentarity Problems. Springer Science & Business Media.\n\n[20] Fan, J., Ma, C. and Zhong, Y. (2019). A selective overview of deep learning. arXiv preprint\n\narXiv:1904.05526.\n\n[21] Geist, M. and Pietquin, O. (2013). Algorithmic survey of parametric value function approxima-\n\ntion. IEEE Transactions on Neural Networks and Learning Systems, 24 845\u2013867.\n\n[22] Ghavamzadeh, M., Lazaric, A., Maillard, O. and Munos, R. (2010). LSTD with random pro-\n\njections. In Advances in Neural Information Processing Systems.\n\n[23] Haarnoja, T., Tang, H., Abbeel, P. and Levine, S. (2017). Reinforcement learning with deep\n\nenergy-based policies. In International Conference on Machine Learning.\n\n[24] Haarnoja, T., Zhou, A., Abbeel, P. and Levine, S. (2018).\n\nmaximum entropy deep reinforcement learning with a stochastic actor.\narXiv:1801.01290.\n\nSoft actor-critic: Off-policy\narXiv preprint\n\n[25] Harker, P. T. and Pang, J.-S. (1990). Finite-dimensional variational inequality and nonlinear\ncomplementarity problems: a survey of theory, algorithms and applications. Mathematical\nProgramming, 48 161\u2013220.\n\n[26] Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D. and Meger, D. (2018). Deep\n\nreinforcement learning that matters. In AAAI Conference on Arti\ufb01cial Intelligence.\n\n[27] Hofmann, T., Sch\u00a8olkopf, B. and Smola, A. J. (2008). Kernel methods in machine learning.\n\nAnnals of Statistics 1171\u20131220.\n\n[28] Jaakkola, T., Jordan, M. I. and Singh, S. P. (1994). Convergence of stochastic iterative dynamic\n\nprogramming algorithms. In Advances in Neural Information Processing Systems.\n\n[29] Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and general-\n\nization in neural networks. In Advances in Neural Information Processing Systems.\n\n[30] Jain, P. and Kar, P. (2017). Non-convex optimization for machine learning. Foundations and\n\nTrends R in Machine Learning, 10 142\u2013336.\n\n[31] Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms.\n\nInformation Processing Systems.\n\nIn Advances in Neural\n\n[32] Kushner, H. and Yin, G. G. (2003). Stochastic Approximation and Recursive Algorithms and\n\nApplications. Springer Science & Business Media.\n\n[33] Lakshminarayanan, C. and Szepesvari, C. (2018). Linear stochastic approximation: How far\ndoes constant step-size and iterate averaging go? In International Conference on Arti\ufb01cial\nIntelligence and Statistics.\n\n[34] Lazaric, A., Ghavamzadeh, M. and Munos, R. (2010). Finite-sample analysis of LSTD. In\n\nInternational Conference on Machine Learning.\n\n10\n\n\f[35] Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J. and Pennington, J. (2019).\nWide neural networks of any depth evolve as linear models under gradient descent. arXiv\npreprint arXiv:1902.06720.\n\n[36] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D.\n(2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.\n\n[37] Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S. and Petrik, M. (2015). Finite-sample analy-\nsis of proximal gradient TD algorithms. In Conference on Uncertainty in Arti\ufb01cial Intelligence.\n\n[38] Melo, F. S., Meyn, S. P. and Ribeiro, M. I. (2008). An analysis of reinforcement learning with\n\nfunction approximation. In International Conference on Machine Learning.\n\n[39] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. and\nIn In-\n\nKavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning.\nternational Conference on Machine Learning.\n\n[40] Neu, G., Jonsson, A. and G\u00b4omez, V. (2017). A uni\ufb01ed view of entropy-regularized markov\n\ndecision processes. arXiv preprint arXiv:1705.07798.\n\n[41] Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y. and Srebro, N. (2018). Towards understand-\ning the role of over-parametrization in generalization of neural networks. arXiv preprint\narXiv:1805.12076.\n\n[42] Pfau, D. and Vinyals, O. (2016). Connecting generative adversarial networks and actor-critic\n\nmethods. arXiv preprint arXiv:1610.01945.\n\n[43] Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Advances\n\nin Neural Information Processing Systems.\n\n[44] Rahimi, A. and Recht, B. (2008). Uniform approximation of functions with random bases. In\n\nAnnual Allerton Conference on Communication, Control, and Computing.\n\n[45] Scherrer, B. (2010). Should one compute the temporal difference \ufb01x point or minimize the\nbellman residual? the uni\ufb01ed oblique projection view. In International Conference on Machine\nLearning.\n\n[46] Schulman, J., Chen, X. and Abbeel, P. (2017). Equivalence between policy gradients and soft\n\nQ-learning. arXiv preprint arXiv:1704.06440.\n\n[47] Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P. (2015). Trust region policy\n\noptimization. In International Conference on Machine Learning.\n\n[48] Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation\n\nand TD learning. arXiv preprint arXiv:1902.00923.\n\n[49] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine\n\nLearning, 3 9\u201344.\n\n[50] Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT press.\n\n[51] Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesv\u00b4ari, C. and\nWiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with\nlinear function approximation. In International Conference on Machine Learning.\n\n[52] Sutton, R. S., Maei, H. R. and Szepesv\u00b4ari, C. (2009). A convergent o(n) temporal-difference\nalgorithm for off-policy learning with linear function approximation. In Advances in Neural\nInformation Processing Systems.\n\n[53] Szepesv\u00b4ari, C. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Arti\ufb01cial\n\nIntelligence and Machine Learning, 4 1\u2013103.\n\n11\n\n\f[54] Touati, A., Bacon, P.-L., Precup, D. and Vincent, P. (2017). Convergent tree-backup and retrace\n\nwith function approximation. arXiv preprint arXiv:1705.09322.\n\n[55] Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function\n\napproximation. In Advances in Neural Information Processing Systems.\n\n[56] Tu, S. and Recht, B. (2017). Least-squares temporal difference learning for the linear quadratic\n\nregulator. arXiv preprint arXiv:1712.08642.\n\n[57] Wang, Y., Chen, W., Liu, Y., Ma, Z.-M. and Liu, T.-Y. (2017). Finite sample analysis of the\nGTD policy evaluation algorithms in Markov setting. In Advances in Neural Information\nProcessing Systems.\n\n[58] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist rein-\n\nforcement learning. Machine Learning, 8 229\u2013256.\n\n[59] Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2016). Understanding deep learn-\n\ning requires rethinking generalization. arXiv preprint arXiv:1611.03530.\n\n[60] Zou, S., Xu, T. and Liang, Y. (2019). Finite-sample analysis for SARSA and Q-learning with\n\nlinear function approximation. arXiv preprint arXiv:1902.02234.\n\n12\n\n\f", "award": [], "sourceid": 6047, "authors": [{"given_name": "Qi", "family_name": "Cai", "institution": "Northwestern University"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Jason", "family_name": "Lee", "institution": "Princeton University"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Northwestern University"}]}