{"title": "Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples", "book": "Advances in Neural Information Processing Systems", "page_first": 10634, "page_last": 10644, "abstract": "Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d.\\ Markovian sample path and linear function approximation. We show that the two time-scale TDC can converge as fast as O(log t/t^(2/3)) under diminishing stepsize, and can converge exponentially fast under constant stepsize, but at the cost of a non-vanishing error. We further propose a TDC algorithm with blockwisely diminishing stepsize, and show that it asymptotically converges with an arbitrarily small error at a blockwisely linear convergence rate. Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.", "full_text": "Two Time-scale Off-Policy TD Learning:\n\nNon-asymptotic Analysis over Markovian Samples\n\nDepartment of Electrical and Computer Engineering\n\nTengyu Xu\n\nThe Ohio State University\n\nxu.3260@osu.edu\n\nShaofeng Zou\n\nDepartment of Electrical Engineering\n\nUniversity at Buffalo, The State University of New York\n\nszou3@buffalo.edu\n\nYingbin Liang\n\nDepartment of Electrical and Computer Engineering\n\nThe Ohio State University\n\nliang.889@osu.edu\n\nAbstract\n\nGradient-based temporal difference (GTD) algorithms are widely used in off-policy\nlearning scenarios. Among them, the two time-scale TD with gradient correction\n(TDC) algorithm has been shown to have superior performance. In contrast to\nprevious studies that characterized the non-asymptotic convergence rate of TDC\nonly under identical and independently distributed (i.i.d.) data samples, we provide\nthe \ufb01rst non-asymptotic convergence analysis for two time-scale TDC under a\nnon-i.i.d. Markovian sample path and linear function approximation. We show\nthat the two time-scale TDC can converge as fast as O( log t\nt2/3 ) under diminishing\nstepsize, and can converge exponentially fast under constant stepsize, but at the cost\nof a non-vanishing error. We further propose a TDC algorithm with blockwisely\ndiminishing stepsize, and show that it asymptotically converges with an arbitrarily\nsmall error at a blockwisely linear convergence rate. Our experiments demonstrate\nthat such an algorithm converges as fast as TDC under constant stepsize, and still\nenjoys comparable accuracy as TDC under diminishing stepsize.\n\n1\n\nIntroduction\n\nIn practice, it is very common that we wish to learn the value function of a target policy based on\ndata sampled by a different behavior policy, in order to make maximum use of the data available. For\nsuch off-policy scenarios, it has been shown that conventional temporal difference (TD) algorithms\n[24, 25] and Q-learning [33] may diverge to in\ufb01nity when using linear function approximation\n[2]. To overcome the divergence issue in off-policy TD learning, [27, 26, 17] proposed a family\nof gradient-based TD (GTD) algorithms, which were shown to have guaranteed convergence in\noff-policy settings and are more \ufb02exible than on-policy learning in practice [18, 23]. Among those\nGTD algorithms, the TD with gradient correction (TDC) algorithm has been veri\ufb01ed to have superior\nperformance [17] [9] and is widely used in practice. To elaborate, TDC uses the mean squared\nprojected Bellman error as the objective function, and iteratively updates the function approximation\nparameter with the assistance of an auxiliary parameter that is also iteratively updated. These two\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fparameters are typically updated with stepsizes diminishing at different rates, resulting the two\ntime-scale implementation of TDC, i.e., the function approximation parameter is updated at a slower\ntime-scale and the auxiliary parameter is updated at a faster time-scale.\nThe convergence of two time-scale TDC and general two time-scale stochastic approximation (SA)\nhave been well studied. The asymptotic convergence has been shown in [4, 6] for two time-scale\nSA, and in [26] for two time-scale TDC, where both studies assume that the data are sampled in\nan identical and independently distributed (i.i.d.) manner. Under non-i.i.d. observed samples, the\nasymptotic convergence of the general two time-scale SA and TDC were established in [14, 36].\nAll the above studies did not characterize how fast the two time-scale algorithms converge, i.e,\nthey did not establish the non-asymptotic convergence rate, which is specially important for a two\ntime-scale algorithm. In order for two time-scale TDC to perform well, it is important to properly\nchoose the relative scaling rate of the stepsizes for the two time-scale iterations. In practice, this can\nbe done by \ufb01xing one stepsize and treating the other stepsize as a tuning hyper-parameter [9], which\nis very costly. The non-asymptotic convergence rate by nature captures how the scaling of the two\nstepsizes affect the performance and hence can serve as a guidance for choosing the two time-scale\nstepsizes in practice. Recently, [8] established the non-asymptotic convergence rate for the projected\ntwo time-scale TDC with i.i.d. samples under diminishing stepsize.\n\u2022 One important open problem that still needs to be addressed is to characterize the non-asymptotic\nconvergence rate for two time-scale TDC under non-i.i.d. samples and diminishing stepsizes, and\nexplore what such a result suggests for designing the stepsizes of the fast and slow time-scales\naccordingly. Existing method developed in [8] that handles the non-asymptotic analysis for\ni.i.d. sampled TDC does not accommodate a direct extension to the non-i.i.d. setting. Thus, new\ntechnical developments are necessary to solve this problem.\n\nFurthermore, although diminishing stepsize offers accurate convergence, constant stepsize is often\npreferred in practice due to its much faster error decay (i.e., convergence) rate. For example,\nempirical results have shown that for one time-scale conventional TD, constant stepsize not only\nyields fast convergence, but also results in comparable convergence accuracy as diminishing stepsize\n[9]. However, for two time-scale TDC, our experiments (see Section 4.2) demonstrate that constant\nstepsize, although yields faster convergence, has much bigger convergence error than diminishing\nstepsize. This motivates to address the following two open issues.\n\u2022 It is important to theoretically understand/explain why constant stepsize yields large convergence\nerror for two-time scale TDC. Existing non-asymptotic analysis for two time-scale TDC [8] focused\nonly on the diminishing stepsize, and does not characterize the convergence rate of two time-scale\nTDC under constant stepsize.\n\n\u2022 For two-time scale TDC, given the fact that constant stepsize yields large convergence error but\nconverges fast, whereas diminishing stepsize has small convergence error but converges slowly,\nit is desirable to design a new update scheme for TDC that converges faster than diminishing\nstepsize, but has as good convergence error as diminishing stepsize.\n\nIn this paper, we comprehensively address the above issues.\n\n1.1 Our Contribution\n\nOur main contributions are summarized as follows.\nWe develop a novel non-asymptotic analysis for two time-scale TDC with a single sample path\nand under non-i.i.d. data. We show that under the diminishing stepsizes \u03b1t = c\u03b1/(1 + t)\u03c3 and\n\u03b2t = c\u03b2/(1 + t)\u03bd respectively for slow and fast time-scales (where c\u03b1, c\u03b2, \u03bd, \u03c3 are positive constants\nand 0 < \u03bd < \u03c3 \u2264 1), the convergence rate can be as large as O( log t\nt2/3 ), which is achieved by\n2 \u03bd = 1. This recovers the convergence rate (up to log t factor due to non-i.i.d. data) in [8] for\n\u03c3 = 3\ni.i.d. data as a special case.\nWe also develop the non-asymptotic analysis for TDC under non-i.i.d. data and constant stepsize.\nIn contrast to conventional one time-scale analysis, our result shows that the training error (at slow\ntime-scale) and the tracking error (at fast time scale) converge at different rates (due to different\ncondition numbers), though both converge linearly to the neighborhood of the solution. Our result\nalso characterizes the impact of the tracking error on the training error. Our result suggests that TDC\n\n2\n\n\funder constant stepsize can converge faster than that under diminishing stepsize at the cost of a large\ntraining error, due to a large tracking error caused by the auxiliary parameter iteration in TDC.\nWe take a further step and propose a TDC algorithm under a blockwise diminishing stepsize inspired\nby [35] in conventional optimization, in which both stepsizes are constants over a block, and decay\nacross blocks. We show that TDC asymptotically converges with an arbitrarily small training error at\na blockwisely linear convergence rate as long as the block length and the decay of stepsizes across\nblocks are chosen properly. Our experiments demonstrate that TDC under a blockwise diminishing\nstepsize converges as fast as vanilla TDC under constant stepsize, and still enjoys comparable accuracy\nas TDC under diminishing stepsize.\nFrom the technical standpoint, our proof develops new tool to handle the non-asymptotic analysis\nof bias due to non-i.i.d. data for two time-scale algorithms under diminishing stepsize that does\nnot require square summability, to bound the impact of the fast-time-scale tracking error on the\nslow-time-scale training error, and the analysis to recursively re\ufb01ne the error bound in order to\nsharpening the convergence rate.\n\n1.2 Related Work\n\nDue to extensive studies on TD learning, we here include only the most relevant work to this paper.\nOn policy TD and SA. The convergence of TD learning with linear function approximation with\ni.i.d samples has been well established by using standard results in SA [5]. The non-asymptotic\nconvergence have been established in [4, 12, 30] for the general SA algorithms with martingale\ndifference noise, and in [7] for TD with i.i.d. samples. For the Markovian settings, the asymptotic\nconvergence has been established in [31, 28] for TD(\u03bb), and the non-asymptotic convergence has\nbeen provided for projected TD(\u03bb) in [3] and for linear SA with Markovian noise in [13, 22, 21]. For\nlinear SA with dynamic Markovian noise, the non-asymptotic analysis of on-policy SARSA under\nnon-i.i.d. samples was recently studied in [37].\nOff policy one time-scale GTD. The convergence of one time-scale GTD and GTD2 (which are\noff-policy TD algorithms) were derived by applying standard results in SA [27, 26, 17]. The non-\nasymptotic analysis for GTD and GTD2 have been conducted in [16] by converting the objective\nfunction into a convex-concave saddle problem, and was further generalized to the Markovian setting\nin [32]. However, such an approach cannot be generalized for analyzing two-time scale TDC that we\nstudy here because TDC does not have an explicit saddle-point representation.\nOff policy two time-scale TDC and SA. The asymptotic convergence of two time-scale TDC under\ni.i.d. samples has been established in [26, 17], and the non-asymptotic analysis has been provided\nin [8] as a special case of two time-scale linear SA. Under Markovian setting, the convergence of\nvarious two time-scale GTD algorithms has been studied in [36]. The non-asymptotic analysis of two\ntime-scale TDC under non-i.i.d. data has not been studied before, which is the focus of this paper.\nGeneral two time-scale SA has also been studied. The convergence of two time-scale SA with\nmartingale difference noise was established in [4], and its non-asymptotic convergence was provided\nin [15, 20, 8, 6]. Some of these results can be applied to two time-scale TDC under i.i.d. samples\n(which can \ufb01t into a special case of SA with martingale difference noise), but not to the non-\ni.i.d. setting. For two time-scale linear SA with more general Markovian noise, only asymptotic\nconvergence was established in [29, 34, 14]. In fact, our non-asymptotic analysis for two time-scale\nTDC can be of independent interest here to be further generalized for studying linear SA with more\ngeneral Markovian noise.\nTwo concurrent and independent studies were posted online recently, which are related to our study.\n[10] provided a non-asymptotic analysis for two time-scale linear SA under the non-i.i.d setting,\nin which both variables are updated with constant stepsize. In contrast, our study provides the\nconvergence rate for the case with the two variables being updated by the stepsizes that diminish at\ndifferent rates, and hence our analysis technique is very different from that in [10]. Another study [11]\nproposed an interesting approach to analyze the convergence rate of TD learning in the Markovian\nsetting via a Markov jump linear system. Such an approach, however, cannot be applied directly to\nstudy the two time-scale TD algorithm that we study here.\n\n3\n\n\f2 Problem Formulation\n\nated Markov chain p(s(cid:48)|s) = (cid:80)\ndistribution of this MDP, i.e.,(cid:80)\n\ufb01ned as: v\u03c0 (s) = E[(cid:80)\u221e\n\n2.1 Off-policy Value Function Evaluation\nWe consider the problem of policy evaluation for a Markov decision process (MDP) (S,A, P, r, \u03b3),\nwhere S \u2282 Rd is a compact state space, A is a \ufb01nite action set, P = P(s(cid:48)|s, a) is the transi-\ntion kernel, r(s, a, s(cid:48)) is the reward function bounded by rmax, and \u03b3 \u2208 (0, 1) is the discount\nfactor. A stationary policy \u03c0 maps a state s \u2208 S to a probability distribution \u03c0(\u00b7|s) over A.\nAt time-step t, suppose the process is in some state st \u2208 S. Then an action at \u2208 A is taken\nbased on the distribution \u03c0(\u00b7|st), the system transitions to a next state st+1 \u2208 S governed by the\ntransition kernel P(\u00b7|st, at), and a reward rt = r(st, at, st+1) is received. Assuming the associ-\na\u2208A p(s(cid:48)|s, a)\u03c0(a|s) is ergodic, let \u00b5\u03c0 be the induced stationary\ns p(s(cid:48)|s)\u00b5\u03c0(s) = \u00b5\u03c0(s(cid:48)). The value function for policy \u03c0 is de-\nt=0 \u03b3tr(st, at, st+1)|s0 = s, \u03c0], and it is known that v\u03c0(s) is the unique\n\ufb01xed point of the Bellman operator T \u03c0, i.e., v\u03c0(s) = T \u03c0v\u03c0(s) := r\u03c0(s) + \u03b3Es(cid:48)|sv\u03c0(s(cid:48)), where\nr\u03c0(s) = Ea,s(cid:48)|sr(s, a, s(cid:48)) is the expected reward of the Markov chain induced by policy \u03c0.\nWe consider policy evaluation problem in the off-policy setting. Namely, a sample path\n{(st, at, st+1)}t\u22650 is generated by the Markov chain according to the behavior policy \u03c0b, but our\ngoal is to obtain the value function of a target policy \u03c0, which is different from \u03c0b.\n2.2 Two Time-Scale TDC\nWhen S is large or in\ufb01nite, a linear function \u02c6v(s, \u03b8) = \u03c6(s)(cid:62)\u03b8 is often used to approximate the\nvalue function, where \u03c6(s) \u2208 Rd is a \ufb01xed feature vector for state s and \u03b8 \u2208 Rd is a parameter\nvector. We can also write the linear approximation in the vector form as \u02c6v(\u03b8) = \u03a6\u03b8, where \u03a6 is\nthe |S| \u00d7 d feature matrix. To \ufb01nd a parameter \u03b8\u2217 \u2208 Rd with E\u00b5\u03c0b\nT \u03c0 \u02c6v(s, \u03b8\u2217).\nThe gradient-based TD algorithm TDC [26] updates the parameter by minimizing the mean-square\nprojected Bellman error (MSPBE) objective, de\ufb01ned as\n\n\u02c6v(s, \u03b8\u2217) = E\u00b5\u03c0b\n\nJ(\u03b8) = E\u00b5\u03c0b\n\n[\u02c6v(s, \u03b8) \u2212 \u03a0T \u03c0 \u02c6v(s, \u03b8)]2,\n\nwhere \u03a0 = \u03a6(\u03a6(cid:62)\u039e\u03a6)\u22121\u03a6(cid:62)\u039e is the orthogonal projection operation into the function space \u02c6V =\n{\u02c6v(\u03b8) | \u03b8 \u2208 Rd and \u02c6v(\u00b7, \u03b8) = \u03c6(\u00b7)(cid:62)\u03b8} and \u039e denotes the |S| \u00d7 |S| diagonal matrix with the\ncomponents of \u00b5\u03c0b as its diagonal entries. Then, we de\ufb01ne the matrices A, B, C and the vector b as\n\nA := E\u00b5\u03c0b\n\n[\u03c1(s, a)\u03c6(s)(\u03b3\u03c6(s(cid:48)) \u2212 \u03c6(s))(cid:62)], B := \u2212\u03b3E\u00b5\u03c0b\n\nC := \u2212E\u00b5\u03c0b\n\n[\u03c6(s)\u03c6(s)(cid:62)],\n\nb := E\u00b5\u03c0b\n\n[\u03c1(s, a)\u03c6(s(cid:48))\u03c6(s)(cid:62)],\n[\u03c1(s, a)r(s, a, s(cid:48))\u03c6(s)],\n\nwhere \u03c1(s, a) = \u03c0(a|s)/\u03c0b(a|s) is the importance weighting factor with \u03c1max being its maximum\nvalue. If A and C are both non-singular, J(\u03b8) is strongly convex and has \u03b8\u2217 = \u2212A\u22121b as its global\nminimum, i.e., J(\u03b8\u2217) = 0. Motivated by minimizing the MSPBE objective function using the\nstochastic gradient methods, TDC was proposed with the following update rules:\n\n\u03b8t+1 = \u03a0R\u03b8 (\u03b8t + \u03b1t(At\u03b8t + bt + Btwt)) ,\nwt+1 = \u03a0Rw (wt + \u03b2t(At\u03b8t + bt + Ctwt)) ,\n\n(1)\n(2)\nwhere At = \u03c1(st, at)\u03c6(st)(\u03b3\u03c6(st+1) \u2212 \u03c6(st))(cid:62), Bt = \u2212\u03b3\u03c1(st, at)\u03c6(st+1)\u03c6(st)(cid:62), Ct =\n\u2212\u03c6(st)\u03c6(st)(cid:62), bt = \u03c1(st, at)r(st, at, st+1)\u03c6(st), and \u03a0R(x) = argminx(cid:48):||x(cid:48)||2\u2264R ||x \u2212 x(cid:48)||2 is\nthe projection operator onto a norm ball of radius R < \u221e. The projection step is widely used in\nthe stochastic approximation literature. As we will show later, iterations (1)-(2) are guaranteed to\nconverge to the optimal parameter \u03b8\u2217 if we choose the value of R\u03b8 and Rw appropriately. TDC with\nthe update rules (1)-(2) is a two time-scale algorithm. The parameter \u03b8 iterates at a slow time-scale\ndetermined by the stepsize {\u03b1t}, whereas w iterates at a fast time-scale determined by the stepsize\n{\u03b2t}. Throughout the paper, we make the following standard assumptions [3, 32, 17].\nAssumption 1 (Problem solvability). The matrix A and C are non-singular.\nAssumption 2 (Bounded feature). (cid:107)\u03c6(s)(cid:107)2 \u2264 1 for all s \u2208 S and \u03c1max < \u221e.\nAssumption 3 (Geometric ergodicity). There exist constants m > 0 and \u03c1 \u2208 (0, 1) such that\n\ndT V (P(st \u2208 \u00b7|s0 = s), \u00b5\u03c0b ) \u2264 m\u03c1t,\u2200t \u2265 0,\n\nsup\ns\u2208S\n\nwhere dT V (P, Q) denotes the total-variation distance between the probability measures P and Q.\n\n4\n\n\fIn Assumption 1, the matrix A is required to be non-singular so that the optimal parameter \u03b8\u2217 =\n\u2212A\u22121b is well de\ufb01ned. The matrix C is non-singular when the feature matrix \u03a6 has linearly\nindependent columns. Assumption 2 can be ensured by normalizing the basis functions {\u03c6i}d\ni=1 and\nwhen \u03c0b(\u00b7|s) is non-degenerate for all s. Assumption 3 holds for any time-homogeneous Markov\nchain with \ufb01nite state-space and any uniformly ergodic Markov chains with general state space.\n\nThroughout the paper, we require R\u03b8 \u2265 (cid:107)A(cid:107)2 (cid:107)b(cid:107)2 and Rw \u2265 2(cid:13)(cid:13)C\u22121(cid:13)(cid:13)2 (cid:107)A(cid:107)2 R\u03b8. In practice, we\n\ncan estimate A, C and b as mentioned in [3] or simply let R\u03b8 and Rw to be large enough.\n\n3 Main Theorems\n\n3.1 Non-asymptotic Analysis under Diminishing Stepsize\n\nOur \ufb01rst main result is the convergence rate of two time-scale TDC with diminishing stepsize. We\nde\ufb01ne the tracking error: zt = wt \u2212 \u03c8(\u03b8t), where \u03c8(\u03b8t) = \u2212C\u22121(b + A\u03b8t) is the stationary point of\nthe ODE given by \u02d9w(t) = Cw(t) + A\u03b8t + b, with \u03b8t being \ufb01xed. Let \u03bb\u03b8 and \u03bbw be any constants\nthat satisfy \u03bbmax(2A(cid:62)C\u22121A) \u2264 \u03bb\u03b8 < 0 and \u03bbmax(2C) \u2264 \u03bbw < 0.\nTheorem 1. Consider the projected two time-scale TDC algorithm in (1)-(2). Suppose Assumptions\n1-3 hold. Suppose we apply diminishing stepsize \u03b1t = c\u03b1\n(1+t)\u03bd which satisfy 0 < \u03bd <\n\u03c3 < 1, 0 < c\u03b1 < 1|\u03bb\u03b8| and 0 < c\u03b2 < 1|\u03bbw| . Suppose \u0001 and \u0001(cid:48) can be any constants in (0, \u03c3 \u2212 \u03bd] and\n(0, 0.5], respectively. Then we have for t \u2265 0:\n\n(1+t)\u03c3 , \u03b2t = c\u03b2\n\n\u2212|\u03bb\u03b8|c\u03b1\n\n1\u2212\u03c3 (t1\u2212\u03c3\u22121)) + O(cid:16) log t\n(cid:17)\n\n+ O(h(\u03c3, \u03bd)),\n\nt\u03c3\n\n(cid:17)\n\n+ O(cid:16) log t\n\nt\u03bd + h(\u03c3, \u03bd)\n\n(cid:17)1\u2212\u0001(cid:48)\n\nE(cid:107)\u03b8t \u2212 \u03b8\u2217(cid:107)2\nE(cid:107)zt(cid:107)2\n\n2 \u2264 O(e\n\n2 \u2264 O(cid:16) log t\n\nt\u03bd\n\nwhere\n\nh(\u03c3, \u03bd) =\n\nt\u03bd ,\nt2(\u03c3\u2212\u03bd)\u2212\u0001 ,\n\n1\n\n\u03c3 > 1.5\u03bd,\n\u03bd < \u03c3 \u2264 1.5\u03bd.\n\nIf 0 < \u03bd < \u03c3 = 1, with c\u03b1 = 1|\u03bb\u03b8| and 0 < c\u03b2 < 1|\u03bbw| , we have for t \u2265 0\n\n(cid:26) 1\n2 \u2264 O(cid:16) (log t)2\n\nt\n\nE(cid:107)\u03b8t \u2212 \u03b8\u2217(cid:107)2\n\n,\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n(cid:17)\n\n+ O(cid:16) log t\n\nt\u03bd + h(1, \u03bd)\n\n(cid:17)1\u2212\u0001(cid:48)\n\n.\n\n1\n\nlog t\n\nwith h(\u03c3, \u03bd) = 1\n\nt2/3 ) with \u03c3 = 3\n\nt\u03bd when \u03c3 > 1.5\u03bd, and h(\u03c3, \u03bd) =\n\nFor explicit expressions of (3), (4) and (6), please refer to (25), (18) and (28) in the Appendix.\nWe further explain Theorem 1 as follows: (a) In (3) and (5), since both \u0001 and \u0001(cid:48) can be arbitrarily\nsmall, the convergence of E(cid:107)\u03b8t \u2212 \u03b8\u2217(cid:107)2\nt2(\u03c3\u2212\u03bd) when \u03bd < \u03c3 < 1.5\u03bd, and\n2 can be almost as fast as\nt\u03bd when 1.5\u03bd \u2264 \u03c3. Then best convergence rate is almost as fast as O( log t\n2 \u03bd = 1. (b)\nIf data are i.i.d. generated, then our bound reduces to E(cid:107)\u03b8t \u2212 \u03b8\u2217(cid:107)2\n2 \u2264 O(exp(\u03bb\u03b8c\u03b1(t1\u2212\u03c3 \u2212 1)/(1 \u2212\n\u03c3))) +O(1/t\u03c3) +O(h(\u03c3, \u03bd))1\u2212\u0001(cid:48)\nt2(\u03c3\u2212\u03bd)\u2212\u0001 when\n\u03bd < \u03c3 \u2264 1.5\u03bd. The best convergence rate is almost as fast as\n2 \u03bd = 1 as given in [8].\nTheorem 1 characterizes the relationship between the convergence rate of \u03b8t and stepsizes \u03b1t and\n\u03b2t. The \ufb01rst term of the bound in (3) corresponds to the convergence rate of \u03b8t with full gradient\n\u2207J(\u03b8t), which exponentially decays with t. The second term is introduced by the bias and variance\nof the gradient estimator which decays sublinearly with t. The last term arises due to the accumulated\ntracking error zt, which speci\ufb01cally arises in two time-scale algorithms, and captures how accurately\nwt tracks \u03c8(\u03b8t). Thus, if wt tracks the stationary point \u03c8(\u03b8t) in each step perfectly, then we have only\nthe \ufb01rst two terms in (3), which matches the results of one time-scale TD learning [3, 7]. Theorem\n1 indicates that asymptotically, (3) is dominated by the tracking error term O(h(\u03c3, \u03bd)1\u2212\u0001(cid:48)\n), which\ndepends on the diminishing rate of \u03b1t and \u03b2t. Since both \u0001 and \u0001(cid:48) can be arbitrarily small, if the\ndiminishing rate of \u03b1t is close to that of \u03b2t, then the tracking error is dominated by the slow drift,\nwhich has an approximate order O(1/t2(\u03c3\u2212\u03bd)); if the diminishing rate of \u03b1t is much faster than that\nof \u03b2t, then the tracking error is dominated by the accumulated bias, which has an approximate order\nO(log t/t\u03bd). Moreover, (5) and (6) suggest that for any \ufb01xed \u03c3 \u2208 (0, 1], the optimal diminishing rate\nof \u03b2t is achieved by \u03c3 = 3\n\nt2/3 with \u03c3 = 3\n\n1\n\n1\n\n2 \u03bd.\n\n5\n\n\fFrom the technical standpoint, we develop novel techniques to handle the interaction between the\ntraining error and the tracking error and sharpen the error bounds recursively. The proof sketch and\nthe detailed steps are provided in Appendix A.\n\n3.2 Non-asymptotic Analysis under Constant Stepsize\n\nAs we remark in Section 1, it has been demonstrated by empirical results [9] that the standard TD\nunder constant stepsize not only converges fast, but also has comparable training error as that under\ndiminishing stepsize. However, this does not hold for TDC. When the two variables in TDC are\nupdated both under constant stepsize, our experiments demonstrate that constant stepsize yields fast\nconvergence, but has large training error. In this subsection, we aim to explain why this happens by\nanalyzing the convergence rate of the two variables in TDC, and the impact of one on the other.\nThe following theorem provides the convergence result for TDC with the two variables iteratively\nupdated respectively by two different constant stepsizes.\nTheorem 2. Consider the projected TDC algorithm in eqs. (1) and (2). Suppose Assumption 1-3 hold.\nSuppose we apply constant stepsize \u03b1t = \u03b1, \u03b2t = \u03b2 and \u03b1 = \u03b7\u03b2 which satisfy \u03b7 > 0, 0 < \u03b1 < 1|\u03bb\u03b8|\nand 0 < \u03b2 < 1|\u03bbw| . We then have for t \u2265 0:\n\nE(cid:107)\u03b8t \u2212 \u03b8\u2217(cid:107)2\n\n2 \u2264 (1 \u2212 |\u03bb\u03b8|\u03b1)t((cid:107)\u03b80 \u2212 \u03b8\u2217(cid:107)2\n\n2 + C1)\n\n(7)\n\n(8)\n\nE(cid:107)zt(cid:107)2\n\n+ C2 max{\u03b1, \u03b1 ln\n1\n\u03b1\n2 \u2264 (1 \u2212 |\u03bbw|\u03b2)t (cid:107)z0(cid:107)2\n2 + C5 max{\u03b2, \u03b2 ln\n|\u03bb\u03b8|(1\u2212|\u03bb\u03b8|\u03b1)T +1 with T = (cid:100) ln[C5 max{\u03b2,ln( 1\n1\u2212(1\u2212|\u03bb\u03b8|\u03b1)T +1\n\n} + (C3 max{\u03b2, \u03b2 ln\n1\n\u03b2\n\u03b2 )\u03b2}/(cid:107)z0(cid:107)2\n2]\n\n1\n\u03b2\n} + C6\u03b7,\n\n} + C4\u03b7)0.5\n\n\u2212 ln(1\u2212|\u03bbw|\u03b2)\n\n(cid:101), and C2, C3, C4,\nwhere C1 = 4\u03b3\u03c1maxR\u03b8Rw\nC5 and C6 are positive constants independent of \u03b1 and \u03b2. For explicit expressions of C2, C3, C4, C5\nand C6, please refer to (67), (68), (69), (59), and (60) in the Supplementary Materials.\nTheorem 2 shows that TDC with constant stepsize converges to a neighborhood of \u03b8\u2217 exponentially\nfast. The size of the neighborhood depends on the second and the third terms of the bound in (7),\nwhich arise from the bias and variance of the update of \u03b8t and the tracking error zt in (8), respectively.\nClearly, the convergence zt, although is also exponentially fast to a neighborhood, is under a different\nrate due to the different condition number. We further note that as the stepsize parameters \u03b1, \u03b2\napproach 0 in a way such that \u03b1/\u03b2 \u2192 0, \u03b8t approaches to \u03b8\u2217 as t \u2192 \u221e, which matches the asymptotic\nconvergence result for two time-scale TDC under constant stepsize in [36].\nDiminishing vs Constant Stepsize: We next discuss the comparison between TDC under diminish-\ning stepsize and constant stepsize. Generally, Theorem 1 suggests that diminishing stepsize yields\nbetter converge guarantee (i.e., converges exactly to \u03b8\u2217) than constant stepsize shown in Theorem 2\n(i.e., converges to the neighborhood of \u03b8\u2217). In practice, constant stepsize is recommended because\ndiminishing stepsize may take much longer time to converge. However, as Figure 2 in Section 4.2\nshows, although TDC with large constant stepsize converges fast, the training error due to the conver-\ngence to the neighborhood is signi\ufb01cantly worse than the diminishing stepsize. More speci\ufb01cally,\nwhen \u03b7 = \u03b1/\u03b2 is \ufb01xed, as \u03b1 grows, the convergence becomes faster, but as a consequence, the\nterm (C3 max{\u03b2, \u03b2 ln 1\n\u03b2} + C4\u03b7)0.5 due to the tracking error increases and results in a large training\nerror. Alternatively, if \u03b1 gets small so that the training error is comparable to that under diminishing\nstepsize, then the convergence becomes very slow. This suggests that simply setting the stepsize to be\nconstant for TDC does not yield desired performance. This motivates us to design an appropriate\nupdate scheme for TDC such that it can enjoy as fast error convergence rate as constant stepsize\noffers, but still have comparable accuracy as diminishing stepsize enjoys.\n\n3.3 TDC under Blockwise Diminishing Stepsize\n\nand study its theoretical convergence guarantee. In Algorithm 1, we de\ufb01ne ts =(cid:80)s\n\nIn this subsection, we propose a blockwise diminishing stepsize scheme for TDC (see Algorithm 1),\n\ni=0 Ts.\n\nThe idea of Algorithm 1 is to divide the iteration process into blocks, and diminish the stepsize\nblockwisely, but keep the stepsize to be constant within each block. In this way, within each block,\n\n6\n\n\f\u03b8s,0 = \u03b8s\u22121, ws,0 = ws\u22121\nfor i = 1, 2, ..., Ts do\n\nAlgorithm 1 Blockwise Diminishing Stepsize TDC\nInput: \u03b80,0 = \u03b80, w0,0 = w0 = 0, T0 = 0, block index S\n1: for s = 1, 2, ..., S do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\nOutput: \u03b8S, wS\n\n(cid:0)\u03b8s,i\u22121 + \u03b1s(Ats\u22121+i\u03b8s,i\u22121 + bts\u22121+i + Bts\u22121+iws,i\u22121)(cid:1)\n(cid:0)ws,i\u22121 + \u03b2s(Ats\u22121+i\u03b8s,i\u22121 + bts\u22121+i + Cts\u22121+iws,i\u22121)(cid:1)\n\nSample (sts\u22121+i, ats\u22121+i, sts\u22121+i+1, rts\u22121+i) from trajetory\n\u03b8s,i = \u03a0R\u03b8\nws,i = \u03a0Rw\n\nend for\n\u03b8s = \u03b8s,Ts, ws = ws,Ts\n\nTDC can decay fast due to constant stepsize and still achieve an accurate solution due to blockwisely\ndecay of the stepsize, as we will demonstrate in Section 4. More speci\ufb01cally, the constant stepsizes\n\u03b1s and \u03b2s for block s are chosen to decay geometrically, such that the tracking error and accumulated\nvariance and bias are asymptotically small; and the block length Ts increases geometrically across\nblocks, such that the training error E(cid:107)\u03b8s \u2212 \u03b8\u2217(cid:107)2\n2 decreases geometrically blockwisely. We note that\nthe design of the algorithm is inspired by the method proposed in [35] for conventional optimization\nproblems.\nThe following theorem characterizes the convergence of Algorithm 1.\nTheorem 3. Consider the projected TDC algorithm with blockwise diminishing stepsize\nSuppose max{log(1/\u03b1s)\u03b1s, \u03b1s} \u2264\nas in Algorithm 1.\nmin{\u0001s\u22121/(4C7), 1/|\u03bbx|}, \u03b2s = \u03b7\u03b1s and Ts = (cid:100)log1/(1\u2212|\u03bbx|\u03b1s) 4(cid:101), where \u03bbx < 0 and C7 > 0 are\nconstant independent of s (see (72) and (75) in the Supplementary Materials for explicit expression\nof \u03bbx and C7), \u0001s = (cid:107)\u03b80 \u2212 \u03b8\u2217(cid:107)2 /2s and \u03b7 \u2265 1/2 max{0, \u03bbmin(C\u22121(A(cid:62) + A))}. Then, after\nS = (cid:100)log2(\u00010/\u0001)(cid:101) blocks, we have\n\nSuppose Assumptions 1-3 hold.\n\n2 \u2264 \u0001.\n\nE(cid:107)\u03b8S \u2212 \u03b8\u2217(cid:107)2\n\u0001 ).\n\n\u0001 log 1\n\nThe total sample complexity is O( 1\nTheorem 3 indicates that the sample complexity of TDC under blockwise diminishing stepsize is\nslightly better than that under diminishing stepsize. Our empirical results (see Section 4.3) also\ndemonstrate that blockwise diminishing stepsize yields as fast convergence as constant stepsize\nand has comparable training error as diminishing stepsize. However, we want to point out that the\nadvantage of blockwise diminishing stepsize does not come for free, rather at the cost of some extra\nparameter tuning in practice to estimate \u00010, |\u03bbx|, C7 and \u03b7; whereas diminishing stepsize scheme as\nguided by our Theorem 1 requires to tune at most three parameters to obtain desirable performance.\n\n4 Experimental Results\n\nIn this section, we provide numerical experiments to verify our theoretical results and the ef\ufb01ciency of\nAlgorithm 1. More precisely, we consider Garnet problems [1] denoted as G(nS, nA, p, q), where ns\ndenotes the number of states, nA denotes the number of actions, p denotes the number of possible next\nstates for each state-action pair, and q denotes the number of features. The reward is state-dependent\nand both the reward and the feature vectors are generated randomly. The discount factor \u03b3 is set to\n0.95 in all experiments. We consider the G(500, 20, 50, 20) problem. For all experiments, we choose\n\u03b80 = w0 = 0. All plots report the evolution of the mean square error over 500 independent runs.\n\n4.1 Optimal Diminishing Stepsize\n\nIn this subsection, we provide numerical results to verify Theorem 1. We compare the performance\nof TDC updates with the same \u03b1t but different \u03b2t. We consider four different diminishing stepsize\nsettings: (1) c\u03b1 = c\u03b2 = 0.03, \u03c3 = 0.15; (2) c\u03b1 = c\u03b2 = 0.18, \u03c3 = 0.30; (3) c\u03b1 = c\u03b2 = 1, \u03c3 = 0.45;\n(4) c\u03b1 = c\u03b2 = 4, \u03c3 = 0.60. For each case with \ufb01xed slow time-scale parameter \u03c3, the fast time-scale\nstepsize \u03b2t has decay rate \u03bd to be 1\n6 \u03c3, and \u03c3. Our results are reported in Figure 1,\n\n3 \u03c3, 5\n\n9 \u03c3, 2\n\n2 \u03c3, 1\n\n3 \u03c3, 5\n\n7\n\n\fin which for each case the left \ufb01gure reports the overall iteration process and the right \ufb01gure reports\nthe corresponding zoomed tail process of the last 100000 iterations. It can be seen that in all cases,\nTDC iterations with the same slow time-scale stepsize \u03c3 share similar error decay rates (see the left\nplot), and the difference among the fast time-scale parameter \u03bd is re\ufb02ected by the behavior of the\nerror convergence tails (see the right plot). We observe that \u03bd = 2\n3 \u03c3 yields the best error decay rate.\nThis corroborates Theorem 1, which illustrates that the fast time-scale stepsize \u03b2t with parameter \u03bd\naffects only the tracking error term in (3), that dominates the error decay rate asymptotically.\n\n(a) \u03c3 = 0.15 (left: full; right: tail)\n\n(b) \u03c3 = 0.3 (left: full; right: tail)\n\n(c) \u03c3 = 0.45 (left: full; right: tail)\n\n(d) \u03c3 = 0.6 (left: full; right: tail)\n\nFigure 1: Comparison among diminishing stepsize settings. For settings \u03c3 = 0.45 and \u03c3 = 0.6, the case\n\u03bd : \u03c3 = 1 : 3 has much larger training error than others and is not included in the tail \ufb01gures.\n\n4.2 Constant Stepsize vs Diminishing Stepsize\n\nIn this subsection, we compare the error decay of TDC under diminishing stepsize with that of TDC\nunder four different constant stepsizes. For diminishing stepsize, we set c\u03b1 = c\u03b2 and \u03c3 = 3\n2 \u03bd, and\ntune their values to the best, which are given by c\u03b1 = c\u03b2 = 1.8, \u03c3 = 3\n2 \u03bd = 0.45. For the four\nconstant-stepsize cases, we \ufb01x \u03b1 for each case, and tune \u03b2 to the best. The resulting parameter settings\nare respectively as follows: \u03b1t = 0.01, \u03b2t = 0.006; \u03b1t = 0.02, \u03b2t = 0.008; \u03b1t = 0.05, \u03b2t = 0.02;\nand \u03b1t = 0.1, \u03b2t = 0.02. The results are reported in Figure 2, in which for both the training and\ntracking errors, the left plot illustrates the overall iteration process and the right plot illustrates the\ncorresponding zoomed error tails. The results suggest that although some large constant stepsizes\n(\u03b1t = 0.05, \u03b2t = 0.02 and \u03b1t = 0.1, \u03b2t = 0.02) yield initially faster convergence than diminishing\nstepsize, they eventually oscillate around a large neighborhood of \u03b8\u2217 due to the large tracking error.\nSmall constant stepsize (\u03b1t = 0.02, \u03b2t = 0.008 and \u03b1t = 0.01, \u03b2t = 0.006) can have almost the\nsame asymptotic accuracy as that under diminishing stepsize, but has very slow convergence rate. We\ncan also observe strong correlation between the training and tracking errors under constant stepsize,\ni.e., larger training error corresponds to larger tracking error, which corroborates Theorem 2 and\nsuggests that the accuracy of TDC heavily depends on the decay of the tracking error (cid:107)zt(cid:107)2.\n\n(a) Training error (left: full; right: tail)\n\n(b) Tracking error (left: full; right: tail)\n\nFigure 2: Comparison between TDC updates under constant stepsizes and diminishing stepsize.\n\n8\n\n024681012# of iterations105050100150200250300Error(||t-*||2): = 1:3: = 1:2: = 5:9: = 2:3: = 5:6: = 1:11.11.121.141.161.181.2# of iterations1060.060.070.080.09Error(||t-*||2): = 1:3: = 1:2: = 5:9: = 2:3: = 5:6: = 1:1024681012# of iterations105050100150200250300Error(||t-*||2): = 1:3: = 1:2: = 5:9: = 2:3: = 5:6: = 1:11.11.121.141.161.181.2# of iterations1060.050.10.150.20.250.3Error(||t-*||2): = 1:3: = 1:2: = 5:9: = 2:3: = 5:6: = 1:1051015# of iterations105050100150200250300Error(||t-*||2): = 1:2: = 5:9: = 2:3: = 5:6: = 1:11.41.421.441.461.481.5# of iterations10600.050.10.150.20.25Error(||t-*||2): = 1:2: = 5:9: = 2:3: = 5:6: = 1:10123# of iterations106050100150200250300Error(||t-*||2): = 1:2: = 5:9: = 2:3: = 5:6: = 1:12.92.922.942.962.983# of iterations10600.050.10.150.20.25Error(||t-*||2): = 1:2: = 5:9: = 2:3: = 5:6: = 1:1012345# of iterations105050100150200250300Error(||t-*||2)t=0.01, t=0.006t=0.02, t=0.008t=0.05, t=0.02t=0.1, t=0.02Diminishing44.24.44.64.85# of iterations105010203040Error(||t-*||2)t=0.01, t=0.006t=0.02, t=0.008t=0.05, t=0.02t=0.1, t=0.02Diminishing012345# of iterations105010203040Error(||zt||2)t=0.01, t=0.006t=0.02, t=0.008t=0.05, t=0.02t=0.1, t=0.02Diminishing44.24.44.64.85# of iterations105010203040Error(||zt||2)t=0.01, t=0.006t=0.02, t=0.008t=0.05, t=0.02t=0.1, t=0.02Diminishing\f4.3 Blockwise Diminishing Stepsize\n\nIn this subsection, we compare the error decay of TDC under blockwise diminishing stepsize with\nthat of TDC under diminishing stepsize and constant stepsize. We use the best tuned parameter\nsettings as listed in Section 4.2 for the latter two algorithms, i.e., c\u03b1 = c\u03b2 = 1.8 and \u03c3 = 3\n2 \u03bd = 0.45\nfor diminishing stepsize, and \u03b1t = 0.1, \u03b2t = 0.02 for constant stepsize. We report our results in\nFigure 3. It can be seen that TDC under blockwise diminishing stepsize converges faster than that\nunder diminishing stepsize and almost as fast as that under constant stepsize. Furthermore, TDC\nunder blockwise diminishing stepsize also has comparable training error as that under diminishing\nstepsize. Since the stepsize decreases geometrically blockwisely, the algorithm approaches to a very\nsmall neighborhood of \u03b8\u2217 in the later blocks. We can also observe that the tracking error under\nblockwise diminishing stepsize decreases rapidly blockwisely.\n\n(a) Training error (left: full; right: tail)\n\n(b) Tracking error (left: full; right: tail)\n\nFigure 3: Comparison between TDC updates under blockwise diminishing stepsizes, diminishing stepsize and\nconstant stepsize\n\n4.4 Robustness to Blocksize\nIn this subsection, we investigate the robustness of TDC under blockwise diminishing stepsize with\nrespect to the blocksize. We consider the same setting as in Section 4.3, and perturb all blocksizes by\ncertain percentages of the original blocksize suggested in the algorithm. It can be seen from Figure 4\nthat the error decay rate changes only very slightly even with a substantial change in the blocksize.\n\nFigure 4: Comparison between TDC updates under blockwise diminishing stepsizes with different blocksizes.\n\n5 Conclusion\n\nIn this work, we provided the \ufb01rst non-asymptotic analysis for the two time-scale TDC algorithm\nover Markovian sample path. We developed a novel technique to handle the accumulative tracking\nerror caused by the two time-scale update, using which we characterized the non-asymptotic conver-\ngence rate with general diminishing stepsize and constant stepsize. We also proposed a blockwise\ndiminishing stepsize scheme for TDC and proved its convergence. Our experiments demonstrated the\nperformance advantage of such an algorithm over both the diminishing and constant stepsize TDC\nalgorithms. Our technique for non-asymptotic analysis of two time-scale algorithms can be applied to\nstudying other off-policy algorithms such as actor-critic [18] and gradient Q-learning algorithms [19].\n\nAcknowledgment\n\nThe work of T. Xu and Y. Liang was supported in part by the U.S. National Science Foundation under\nthe grants CCF-1761506, ECCS-1818904, and CCF-1801855.\n\n9\n\n00.511.522.53# of iterations1050100200300400Error(||t-*||2)BlockwiseDiminishingConstant22.22.42.62.83# of iterations10505101520Error(||t-*||2)BlockwiseDiminishingConstant00.511.522.53# of iterations105050100150200Error(||zt||2)BlockwiseDiminishingConstant22.22.42.62.83# of iterations10505101520Error(||zt||2)BlockwiseDiminishingConstant00.511.52# of iterations105050100150200250300Error(||t-*||2)Original10%20%30%-10%-20%-30%\fReferences\n[1] T. Archibald, K. McKinnon, and L. Thomas. On the generation of Markov decision processes.\n\nJournal of the Operational Research Society, 46(3):354\u2013361, 1995.\n\n[2] L. Baird. Residual algorithms: Reinforcement learning with function approximation.\n\nMachine Learning Proceedings, pages 30\u201337. Morgan Kaufmann, 1995.\n\nIn\n\n[3] J. Bhandari, D. Russo, and R. Singal. A \ufb01nite time analysis of temporal difference learning with\nlinear function approximation. In Conference on Learning Theory (COLT), pages 1691\u20131692,\n2018.\n\n[4] V. S. Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer,\n\n2009.\n\n[5] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation\n\nand reinforcement learning. Journal on Control and Optimization, 38(2):447\u2013469, 2000.\n\n[6] V. S. Borkar and S. Pattathil. Concentration bounds for two time scale stochastic approximation.\nIn Proc. Allerton Conference on Communication, Control, and Computing (Allerton), pages\n504\u2013511, 2018.\n\n[7] G. Dalal, B. Sz\u00f6r\u00e9nyi, G. Thoppe, and S. Mannor. Finite sample analyses for TD (0) with\n\nfunction approximation. In Proc. AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[8] G. Dalal, B. Szorenyi, G. Thoppe, and S. Mannor. Finite sample analysis of two-timescale\nstochastic approximation with applications to reinforcement learning. In Proc. Conference on\nLearning Theory (COLT), 2018.\n\n[9] C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A survey and\n\ncomparison. The Journal of Machine Learning Research, 15(1):809\u2013883, 2014.\n\n[10] H. Gupta, R. Srikant, and L. Ying. Finite-time performance bounds and adaptive learning rate\nselection for two time-scale reinforcement learning. To appear in Proc. Advances in Neural\nInformation Processing Systems (NeurIPS), 2019.\n\n[11] B. Hu and U. A. Syed. Characterizing the exact behaviors of temporal difference learning\nalgorithms using Markov jump linear system theory. To appear in Proc. Advances in Neural\nInformation Processing Systems (NeurIPS), 2019.\n\n[12] S. Kamal. On the convergence, lock-in probability and sample complexity of stochastic\n\napproximation. Journal on Control and Optimization, 48(8):5178\u20135192, 2010.\n\n[13] P. Karmakar and S. Bhatnagar. Dynamics of stochastic approximation with Markov iterate-\ndependent noise with the stability of the iterates not ensured. arXiv preprint arXiv:1601.02217,\n2016.\n\n[14] P. Karmakar and S. Bhatnagar. Two time-scale stochastic approximation with controlled\nMarkov noise and off-policy temporal-difference learning. Mathematics of Operations Research,\n43(1):130\u2013151, 2017.\n\n[15] V. R. Konda, J. N. Tsitsiklis, et al. Convergence rate of linear two-time-scale stochastic\n\napproximation. The Annals of Applied Probability, 14(2):796\u2013819, 2004.\n\n[16] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik. Finite-sample analysis of\nproximal gradient TD algorithms. In Proc. Uncertainty in Arti\ufb01cial Intelligence (UAI), pages\n504\u2013513. AUAI Press, 2015.\n\n[17] H. R. Maei. Gradient temporal-difference learning algorithms. PhD thesis, University of\n\nAlberta, 2011.\n\n[18] H. R. Maei. Convergent actor-critic algorithms under off-policy training and function approxi-\n\nmation. arXiv preprint arXiv:1802.07842, 2018.\n\n10\n\n\f[19] H. R. Maei and R. S. Sutton. GQ (lambda): A general gradient algorithm for temporal-difference\nprediction learning with eligibility traces. In Proc. Arti\ufb01cial General Intelligence (AGI). Atlantis\nPress, 2010.\n\n[20] A. Mokkadem and M. Pelletier. Convergence rate and averaging of nonlinear two-time-scale\nstochastic approximation algorithms. The Annals of Applied Probability, 16(3):1671\u20131702,\n2006.\n\n[21] L. Y. R. Srikant. Finite-time error bounds for linear stochastic approximation and TD learning.\n\narXiv preprint arXiv:1902.00923, 2019.\n\n[22] A. Ramaswamy and S. Bhatnagar. Stability of stochastic approximations with \u201ccontrolled\nmarkov\u201d noise and temporal difference learning. Transactions on Automatic Control, 64(6):2614\u2013\n2620, 2018.\n\n[23] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy\n\ngradient algorithms. In Proc. International Conference on Machine Learning (ICML), 2014.\n\n[24] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning,\n\n3(1):9\u201344, 1988.\n\n[25] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.\n\n[26] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00e1ri, and E. Wiewiora.\nFast gradient-descent methods for temporal-difference learning with linear function approx-\nimation. In Proc. International Conference on Machine Learning (ICML), pages 993\u20131000,\n2009.\n\n[27] R. S. Sutton, C. Szepesv\u00e1ri, and H. R. Maei. A convergent o(n) algorithm for off-policy temporal-\ndifference learning with linear function approximation. Advances in Neural Information\nProcessing Systems (NIPS), 21(21):1609\u20131616, 2008.\n\n[28] V. Tadi\u00b4c. On the convergence of temporal-difference learning with linear function approximation.\n\nMachine Learning, 42(3):241\u2013267, Mar 2001.\n\n[29] V. B. Tadic. Almost sure convergence of two time-scale stochastic approximation algorithms.\n\nIn Proc. American Control Conference, volume 4, pages 3802\u20133807, 2004.\n\n[30] G. Thoppe and V. Borkar. A concentration bound for stochastic approximation via Alekseev\u2019s\n\nformula. Stochastic Systems, 9(1):1\u201326, 2019.\n\n[31] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-diffference learning with function\napproximation. In Proc. Advances in Neural Information Processing Systems (NIPS), pages\n1075\u20131081, 1997.\n\n[32] Y. Wang, W. Chen, Y. Liu, Z.-M. Ma, and T.-Y. Liu. Finite sample analysis of the GTD policy\nevaluation algorithms in Markov setting. In Proc. Advances in Neural Information Processing\nSystems (NIPS), pages 5504\u20135513, 2017.\n\n[33] C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279\u2013292, 1992.\n\n[34] V. Yaji and S. Bhatnagar. Stochastic recursive inclusions in two timescales with non-additive\n\niterate dependent Markov noise. arXiv preprint arXiv:1611.05961, 2016.\n\n[35] T. Yang, Y. Yan, Z. Yuan, and R. Jin. Why does stagewise training accelerate convergence of\n\ntesting error over SGD? arXiv preprint arXiv:1812.03934, 2018.\n\n[36] H. Yu. On convergence of some gradient-based temporal-differences algorithms for off-policy\n\nlearning. arXiv preprint arXiv:1712.09652, 2017.\n\n[37] S. Zou, T. Xu, and Y. Liang. Finite-sample analysis for SARSA with linear function approx-\nimation. To appear in Proc. Advances in Neural Information Processing Systems (NeurIPS),\n2019.\n\n11\n\n\f", "award": [], "sourceid": 5672, "authors": [{"given_name": "Tengyu", "family_name": "Xu", "institution": "The Ohio State University"}, {"given_name": "Shaofeng", "family_name": "Zou", "institution": "University at Buffalo, the State University of New York"}, {"given_name": "Yingbin", "family_name": "Liang", "institution": "The Ohio State University"}]}