{"title": "Characterizing the Exact Behaviors of Temporal Difference Learning Algorithms Using Markov Jump Linear System Theory", "book": "Advances in Neural Information Processing Systems", "page_first": 8479, "page_last": 8490, "abstract": "In this paper, we provide a unified analysis of temporal difference learning algorithms with linear function approximators by exploiting their connections to Markov jump linear systems (MJLS). We tailor the MJLS theory developed in the control community to characterize the exact behaviors of the first and second order moments of a large family of temporal difference learning algorithms. For both the IID and Markov noise cases, we show that the evolution of some augmented versions of the mean and covariance matrix of the TD estimation error exactly follows the trajectory of a deterministic linear time-invariant (LTI) dynamical system. Applying the well-known LTI system theory, we obtain closed-form expressions for the mean and covariance matrix of the TD estimation error at any time step. We provide a tight matrix spectral radius condition to guarantee the convergence of the covariance matrix of the TD estimation error, and perform a perturbation analysis to characterize the dependence of the TD behaviors on learning rate. For the IID case, we provide an exact formula characterizing how the mean and covariance matrix of the TD estimation error converge to the steady state values at a linear rate. For the Markov case, we use our formulas to explain how the behaviors of TD learning algorithms are affected by learning rate and the underlying Markov chain. For both cases, upper and lower bounds for the mean square TD error are provided. The mean square TD error is shown to converge linearly to an exact limit.", "full_text": "Characterizing the Exact Behaviors of\n\nTemporal Difference Learning Algorithms Using\n\nMarkov Jump Linear System Theory\n\nBin Hu,\n\nUsman Ahmed Syed\n\nDepartment of Electrical and Computer Engineering\n\nCoordinated Science Laboratory\n\nUniversity of Illinois at Urbana-Champaign\n\nAbstract\n\nIn this paper, we provide a uni\ufb01ed analysis of temporal difference learning al-\ngorithms with linear function approximators by exploiting their connections to\nMarkov jump linear systems (MJLS). We tailor the MJLS theory developed in the\ncontrol community to characterize the exact behaviors of the \ufb01rst and second order\nmoments of a large family of temporal difference learning algorithms. For both\nthe IID and Markov noise cases, we show that the evolution of some augmented\nversions of the mean and covariance matrix of the TD estimation error exactly fol-\nlows the trajectory of a deterministic linear time-invariant (LTI) dynamical system.\nApplying the well-known LTI system theory, we obtain closed-form expressions\nfor the mean and covariance matrix of the TD estimation error at any time step. We\nprovide a tight matrix spectral radius condition to guarantee the convergence of the\ncovariance matrix of the TD estimation error, and perform a perturbation analysis\nto characterize the dependence of the TD behaviors on learning rate. For the IID\ncase, we provide an exact formula characterizing how the mean and covariance\nmatrix of the TD estimation error converge to the steady state values at a linear\nrate. For the Markov case, we use our formulas to explain how the behaviors of TD\nlearning algorithms are affected by learning rate and the underlying Markov chain.\nFor both cases, upper and lower bounds for the mean square TD error are derived.\nAn exact formula for the steady state mean square TD error is also provided.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) has shown great promise in solving sequential decision making tasks\n[5, 48]. One important topic for RL is policy evaluation whose objective is to evaluate the value\nfunction of a given policy. A large family of temporal difference (TD) learning methods including\nstandard TD, GTD, TDC, GTD2, DTD, and ATD [47, 50, 49, 38] have been developed to solve the\npolicy evaluation problem. These TD learning algorithms have become important building blocks\nfor RL algorithms. See [17] for a comprehensive survey. Despite the popularity of TD learning,\nthe behaviors of these algorithms have not been fully understood from a theoretical viewpoint. The\nstandard ODE technique [51, 9, 7, 36, 8] can only be used to prove asymptotic convergence. Finite\nsample bounds are challenging to obtain and typically developed in a case-by-case manner. Recently,\nthere have been intensive research activities focusing on establishing \ufb01nite sample bounds for TD\nlearning methods with linear function approximations under various assumptions. The IID noise\ncase is covered in [16, 37, 41]. In [6], the analysis is extended for a Markov noise model but an\nextra projection step in the algorithm is required. Very recently, \ufb01nite sample bounds for the TD\nmethod (without the projection step) under the Markov assumption have been obtained in [45].\nThe bounds in [45] actually work for any TD learning algorithm that can be modeled by a linear\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fstochastic approximation scheme. It remains unclear how tight these bounds are (especially for the\nlarge learning rate region). To complement the existing analysis results and techniques, we propose\na general uni\ufb01ed analysis framework for TD learning algorithms by borrowing the Markov jump\nlinear system (MJLS) theory [14] from the controls literature. Our approach is inspired by a recent\nresearch trend in applying control theory for analysis of optimization algorithms [39, 30, 31, 29, 21,\n52, 15, 46, 28, 32, 22, 3, 40, 26, 4, 18, 43], and extends the jump system perspective for \ufb01nite sum\noptimization methods in [31] to TD learning.\nOur key insight is that TD learning algorithms with linear function approximations are essentially\njust Markov jump linear systems. Notice that a MJLS is described by a linear state space model\nwhose state/input matrices are functions of a jump parameter sampled from a \ufb01nite state Markov\nchain. Since the behaviors of MJLS have been well established in the controls \ufb01eld [14, 23, 1, 12, 13,\n33, 34, 19, 20, 44], we can borrow the analysis tools there to analyze TD learning algorithms in a\nmore uni\ufb01ed manner. Our main contributions are summarized as follows.\n\n1. We present a uni\ufb01ed Markov jump linear system perspective on a large family of TD learning\nalgorithms including TD, TDC, GTD, GTD2, ATD, and DTD. Speci\ufb01cally, we make the\nkey observation that these methods are just MJLS subject to some prescribed input.\n\n2. By tailoring the existing MJLS theory, we show that the evolution of some augmented\nversions of the mean and covariance matrix of the estimation error in all above TD learn-\ning methods exactly follows the trajectory of a deterministic linear time-invariant (LTI)\ndynamical system for both the IID and Markov noise cases. As a result, we obtain uni\ufb01ed\nclosed-form formulas for the mean and covariance matrix of the TD estimation error at any\ntime step.\n\n3. We provide a tight matrix spectral radius condition to guarantee the convergence of the\ncovariance matrix of the TD estimation error under the general Markov assumption. By\nusing the matrix perturbation theory [42, 35, 2, 24], we perform a perturbation analysis to\nshow the dependence of the behaviors of TD learning on learning rate in a more transparent\nmanner. For the IID case, we provide an exact formula characterizing how the mean and\ncovariance matrix of the TD estimation error converge to the steady state values at a linear\nrate. For the Markov case, we use our formulas to explain how the behaviors of TD learning\nalgorithms are affected by learning rate and the underlying Markov chain. For both cases,\nwe have shown that the mean square error of TD learning converges linearly to a limit whose\nexact formula is also provided. In addition, upper and lower bounds for the mean square\nerror of TD learning are simultaneously obtained.\n\nWe view our proposed analysis as a complement rather than a replacement for existing analysis\ntechniques. The existing analysis focuses on upper bounds for the TD estimation error. Our closed-\nform formulas provide both upper and lower bounds for the mean square error of TD learning. Our\nanalysis also characterizes the exact limit of the steady state TD error and related convergence rates.\n\n2 Background\n\n2.1 Notation\nThe set of m-dimensional real vectors is denoted as Rm. The Kronecker product of two matrices A\nand B is denoted by A \u2297 B. Notice (A \u2297 B)T = AT \u2297 BT and (A \u2297 B)(C \u2297 D) = (AC) \u2297 (BD)\nwhen the matrices have compatible dimensions. Let vec denote the standard vectorization operation\nthat stacks the columns of a matrix into a vector. We have vec(AXB) = (BT \u2297 A) vec(X). Let sym\ndenote the symmetrization operation, i.e. sym(A) = AT+A\n. Let diag(Hi) denote a matrix whose\n(i, i)-th block is Hi and all other blocks are zero. Speci\ufb01cally, given Hi for i = 1, . . . , n, we have\n\n2\n\n\uf8ee\uf8ef\uf8f0H1\n\n...\n\n0\n\ndiag(Hi) =\n\n\uf8f9\uf8fa\uf8fb .\n\n0\n\n. . .\n...\n...\n. . . Hn\n\nA square matrix is Schur stable if all its eigenvalues have magnitude strictly less than 1. A square\nmatrix is Hurwitz if all its eigenvalues have strictly negative real parts. The spectral radius of a matrix\nH is denoted as \u03c3(H). The eigenvalue with the largest magnitude of H is denoted as \u03bbmax(H) and\nthe eigenvalue with the largest real part of H is denoted as \u03bbmax real(H).\n\n2\n\n\f2.2 Linear time-invariant systems\n\nA linear time-invariant (LTI) system is typically governed by the following state-space model\n\n(1)\nwhere xk \u2208 Rnx, uk \u2208 Rnu, H \u2208 Rnx\u00d7nx, and G \u2208 Rnx\u00d7nu. The LTI system theory has been well\ndocumented in standard control textbooks [27, 10]. Here we brie\ufb02y review several useful results.\n\nxk+1 = Hxk + Guk,\n\n\u2022 Closed-form formulas for xk: Given an initial condition x0 and an input sequence {uk},\n\nthe sequence {xk} can be determined using the following closed-form expression\n\nk\u22121(cid:88)\n\nxk = (H)kx0 +\n\n(H)k\u22121\u2212tGut,\n\n(2)\n\nt=0\n\nwhere (H)k stands for the k-th power of the matrix H.\n\u2022 Necessary and suf\ufb01cient stability condition: When H is Schur stable, we know\n(H)kx0 \u2192 0 for any arbitrary x0. When \u03c3(H) \u2265 1, there always exists x0 such that (H)kx0\ndoes not converge to 0. When \u03c3(H) > 1, there even exists x0 such that (H)kx0 \u2192 \u221e. See\nSection 7.2 in [27] for a detailed discussion. A well-known result in the controls literature is\nthat the LTI system (1) is stable if and only if H is Schur stable.\n\u2022 Exact limit for xk: If H is Schur stable and uk converges to a limit u\u221e, then xk will\nconverge to an exact limit. This is formalized as follows.\nProposition 1. Consider the LTI system (1). If \u03c3(H) < 1 and limk\u2192\u221e uk = u\u221e, then\nlimk\u2192\u221e xk exists and we have x\u221e = limk\u2192\u221e xk = (I \u2212 H)\u22121Gu\u221e.\n\nxk = x\u221e + (H)k(x0 \u2212 x\u221e).\n\n\u2022 Response for constant input: If uk = u \u2200k and \u03c3(H) < 1, then the closed-form expression\nfor xk can be further simpli\ufb01ed to give the following tight convergence rate result.\nProposition 2. Suppose \u03c3(H) < 1, and xk is determined by (1). If uk = u \u2200k, then xk\nconverges to a limit point x\u221e = limk\u2192\u221e xk = (I \u2212 H)\u22121Gu. And we can compute xk as\n(3)\nIn addition, (cid:107)xk \u2212 x\u221e(cid:107) \u2264 C0(\u03c3(H) + \u03b5)k for some C0 and any arbitrarily small \u03b5 > 0.\nFrom the above proposition, we can clearly see that now xk is a sum of a constant steady\nstate term x\u221e and a matrix power term that decays at a linear rate speci\ufb01ed by \u03c3(H)\n(see Section 2.2 in [39] for more explanations). The convergence rate characterized by\n(\u03c3(H) + \u03b5) is tight. More discussions on the tightness of this convergence rate are provided\nin the supplementary material.\n\u2022 Response for exponentially shrinking input: When uk itself converges at a linear rate\n\u02dc\u03c1 and H is Schur stable, xk will converge to its limit point at a linear rate speci\ufb01ed by\nmax{\u03c3(H) + \u03b5, \u02dc\u03c1}. A formal statement is provided as follows.\nProposition 3. Suppose \u03c3(H) < 1, and xk is determined by (1).\nIf uk converges to\nu\u221e as (cid:107)uk \u2212 u\u221e(cid:107) \u2264 C \u02dc\u03c1k, then we have x\u221e = limk\u2192\u221e xk = (I \u2212 H)\u22121Gu\u221e and\n(cid:107)xk \u2212 x\u221e(cid:107) \u2264 C0 (max{\u03c3(H) + \u03b5, \u02dc\u03c1})k for some C0 and any arbitrarily small \u03b5 > 0.\n\nThe results in Propositions 1, 2, and 3 are well known in the control community. For completeness,\nwe will include their proofs in the supplementary material.\n\n2.3 Markov jump linear systems\n\nAnother important class of dynamic systems that have been extensively studied in the controls\nliterature is the so-called Markov jump linear system (MJLS) [14]. Let {zk} be a Markov chain\nsampled from a \ufb01nite state space S. A MJLS is governed by the following state-space model:\n\n\u03bek+1 = H(zk)\u03bek + G(zk)yk,\n\n(4)\nwhere H(zk) and G(zk) are matrix functions of zk. Here, \u03bek is the state, and yk is the input. There\nis a one-to-one mapping from S to the set N := {1, 2, . . . , n} where n = |S|. We can assume H(zk)\nis sampled from a set of matrices {H1, H2, . . . , Hn} and G(zk) is sampled from {G1, G2, . . . , Gn}.\nWe have H(zk) = Hi and G(zk) = Gi when zk = i. The MJLS theory has been well developed in\nthe controls community [14]. We will apply the MJLS theory to analyze TD learning algorithms.\n\n3\n\n\f3 A general Markov jump system perspective for TD learning\n\nIn this section, we provide a general jump system perspective for TD learning with linear function\napproximations. Notice that many TD learning algorithms including TD, TDC, GTD, GTD2, A-TD,\nand D-TD can be modeled by the following linear stochastic recursion:\n\n\u03bek+1 = \u03bek + \u03b1(cid:0)A(zk)\u03bek + b(zk)(cid:1) ,\n\n(5)\nwhere {zk} forms a \ufb01nite state Markov chain and b(zk) satis\ufb01es limk\u2192\u221e Eb(zk) = 0.1 We have\nA(zk) = Ai and b(zk) = bi when zk = i. For simplicity, we mainly focus on analyzing (5). Other\nmodels including two time-scale schemes [25, 54] will be discussed in the supplementary material.\nOur key observation is that (5) can be rewritten as the following MJLS\n\n\u03bek+1 = (I + \u03b1A(zk))\u03bek + \u03b1b(zk).\n\n(6)\nThe above model is a special case of (4) if we set H(zk) = I + \u03b1A(zk), G(zk) = \u03b1b(zk), and\nyk = 1 \u2200k. Consequently, many TD learning algorithms can be analyzed using the MJLS theory.\nWe will borrow the analysis idea from the standard MJLS theory. Our analysis is built upon the fact\nthat some augmented versions of the mean and the covariance matrix of {\u03bek} for the MJLS model (4)\nactually follow the dynamics of a deterministic LTI model in the form of (1) [14, Chapter 3]. To see\nthis, we denote the transition probabilities for the Markov chain {zk} as pij := P(zk+1 = j|zk = i)\nand specify the transition matrix P by setting its (i, j)-th entry to be pij. Obviously, we have pij \u2265 0\nj=1 pij = 1 for all i. Next, the indicator function 1{zk=i} is de\ufb01ned as 1{zk=i} = 1 if zk = i\n\nand(cid:80)n\n\nand 1{zk=i} = 0 otherwise. Now we de\ufb01ne the following key quantities:\n\ni = E(cid:0)\u03bek(\u03bek)T1{zk=i}(cid:1) .\n\nSuppose yk = 1 \u2200k. Based on [14, Proposition 3.35], qk and Qk can be iteratively calculated as\n\nqk\n\ni = E(cid:0)\u03bek1{zk=i}(cid:1) , Qk\nn(cid:88)\nn(cid:88)\n\ni + Gipk\n\npij(Hiqk\n\ni=1\n\ni ),\n\nwhere pk\n\nqk+1\nj =\n\n1\n\ni=1\n\npij\n\n(cid:20)\n\nqk+1\n\nqk =\n\nQk+1\n\nj =\n\n1 Qk\n2\n\ni := P(zk = i). If we further augment qk\n\ni and Qk\n\n(cid:0)HiQk\n\uf8ee\uf8ef\uf8f0qk\n\uf8f9\uf8fa\uf8fb , Qk =(cid:2)Qk\n...\n(cid:20)H11\n(cid:21)\n(cid:21)(cid:20)\nqk\nn\n\uf8ee\uf8ef\uf8f0p11H1\n\uf8ee\uf8ef\uf8f0p11H1 \u2297 H1\n\uf8f9\uf8fa\uf8fb ,H22 =\n\uf8ee\uf8ef\uf8f0p11(H1 \u2297 G1 + G1 \u2297 H1)\n\uf8ee\uf8ef\uf8f0p11G1\n\uf8ee\uf8ef\uf8f0pk\n\uf8f9\uf8fa\uf8fb\n\np1n(H1 \u2297 G1 + G1 \u2297 H1)\n1I\n...\npk\nnI\n\nvec(Qk+1)\nq , and uk\npn1Hn\n\n. . .\n...\n. . . pnnGn\n\n\uf8f9\uf8fa\uf8fb , uk\n\n0\nH21 H22\n\nQ are given by\n\np1nH1 \u2297 H1\n\npn1Gn\n\np1nG1\n\nQ =\n\n=\n\n...\n\n...\n\n...\n\n...\n\nwhere H11, H21, H22, uk\n. . .\n...\n. . . pnnHn\n\nH11 =\n\np1nH1\n\n...\n\n...\n\nH21 =\n\nuk\nq =\n\nthen it is straightforward to rewrite (7) (8) as the following LTI system\n\ni H T\n\ni + 2 sym(Hiqk\n\ni GT\n\ni ) + pk\n\ni GiGT\ni\n\n(cid:1) ,\n\ni as\n\n(cid:3) ,\n(cid:20)uk\n\nq\nuk\nQ\n\n(cid:21)\n\n,\n\n. . . Qk\nn\n\n(cid:21)\n\nqk\n\nvec(Qk)\n\n+\n\n\uf8f9\uf8fa\uf8fb ,\n\npn1Hn \u2297 Hn\n\n. . .\n...\n. . . pnnHn \u2297 Hn\n\n...\n\n\uf8f9\uf8fa\uf8fb ,\n\n. . . pn1(Hn \u2297 Gn + Gn \u2297 Hn),\n...\n. . .\n\npnn(Hn \u2297 Gn + Gn \u2297 Hn)\n\n...\n\n\uf8ee\uf8ef\uf8f0p11G1 \u2297 G1\n\np1nG1 \u2297 G1\n\n...\n\npn1Gn \u2297 Gn\n\n. . .\n...\n. . . pnnGn \u2297 Gn\n\n...\n\n(7)\n\n(8)\n\n(9)\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8f0pk\n\n1I\n...\npk\nnI\n\n\uf8f9\uf8fa\uf8fb .\n\n(10)\n\n1This standard assumption is typically related to the projected Bellman equation and can always be enforced\n\nby a shifting argument. More explanations are provided in Remark 1.\n\n4\n\n\ftheory reviewed in Section 2.2. Obviously, we have E\u03bek =(cid:80)n\nE(cid:107)\u03bek(cid:107)2 = trace((cid:80)n\n\nA detailed derivation for the above result is presented in the supplementary material. A key implication\nhere is that qk and vec(Qk) follow the LTI dynamics (9) and can be analyzed using the standard LTI\ni , and\nn \u2297 vec(In\u03be )T) vec(Qk). Hence the mean, covariance, and mean\nsquare norm of \u03bek can all be calculated using closed-form expressions. We will present a detailed\nanalysis for (6) and provide related implications for TD learning in the next two sections.\nFor illustrative purposes, we explain the jump system perspective for the standard TD method.\n\ni , E(cid:0)\u03bek(\u03bek)T(cid:1) =(cid:80)n\n\ni ) = (1T\n\ni=1 Qk\n\ni=1 Qk\n\ni=1 qk\n\nExample 1: TD method. The standard TD method (or TD(0)) uses the following update rule:\n\n\u03b8k+1 = \u03b8k \u2212 \u03b1\u03c6(sk)(cid:0)(\u03c6(sk) \u2212 \u03b3\u03c6(sk+1))T\u03b8k \u2212 r(sk)(cid:1) ,\n\nqk =\n\ni = p\u221e\n\n(11)\nwhere {sk} is the underlying Markov chain, \u03c6 is the feature vector, r is the reward, \u03b3 is the discounting\nfactor, and \u03b8k is the weight vector to be estimated. Suppose \u03b8\u2217 is the vector that solves the projected\n\nSuppose limk\u2192\u221e pk\ni bi = 0\nare actually equivalent, we have naturally enforced limk\u2192\u221e Eb(zk) = 0. Therefore, the TD update\ncan be modeled as (6) with b(zk) satisfying limk\u2192\u221e Eb(zk) = 0. See Section 3.1 in [45] for a\nsimilar formulation. Now we can apply the MJLS theory and the LTI model (9) to analyze the\n\nBellman equation. We can set zk =(cid:2)(sk+1)T\n(sk)T(cid:3)T and then rewrite the TD update as\n\u03b8k+1 \u2212 \u03b8\u2217 =(cid:0)I + \u03b1A(zk)(cid:1) (\u03b8k \u2212 \u03b8\u2217) + \u03b1b(zk),\nwhere A(zk) = \u03c6(sk)(\u03b3\u03c6(sk+1) \u2212 \u03c6(sk))T and b(zk) = \u03c6(sk)(cid:0)r(sk) + (\u03c6(sk) \u2212 \u03b3\u03c6(sk+1))T\u03b8\u2217(cid:1).\ni . Since the projected Bellman equation and the equation(cid:80)n\ncovariance E(cid:0)(\u03b8k \u2212 \u03b8\u2217)(\u03b8k \u2212 \u03b8\u2217)T(cid:1) and the mean square error E(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2. In this case, we have\n\uf8ee\uf8ef\uf8f0vec(cid:0)E((\u03b8k \u2212 \u03b8\u2217)(\u03b8k \u2212 \u03b8\u2217)T1{zk=1})(cid:1)\n\uf8f9\uf8fa\uf8fb .\nvec(cid:0)E((\u03b8k \u2212 \u03b8\u2217)(\u03b8k \u2212 \u03b8\u2217)T1{zk=n})(cid:1)\nmatrix E(cid:0)(\u03b8k \u2212 \u03b8\u2217)(\u03b8k \u2212 \u03b8\u2217)T(cid:1) and the mean value E(\u03b8k \u2212 \u03b8\u2217) do not directly follow an LTI system.\nE(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2 = trace((cid:80)n\n\nHowever, when working with the augmented covariance matrix Qk and the augmented mean value\nvector qk, we do obtain an LTI model in the form of (1). Once the closed-form expression for Qk\nis obtained, the mean square estimation error for the TD update can be immediately calculated as\n\n\uf8ee\uf8ef\uf8f0E(cid:0)(\u03b8k \u2212 \u03b8\u2217)1{zk=1}(cid:1)\nE(cid:0)(\u03b8k \u2212 \u03b8\u2217)1{zk=n}(cid:1)\n\nThen we can easily analyze qk and Qk by applying the LTI model (9). In general, the covariance\n\n\uf8f9\uf8fa\uf8fb , vec(Qk) =\n\nn \u2297 vec(In\u03b8 )T) vec(Qk).\n\ni=1 p\u221e\n\ni ) = (1T\n\nHere we omit the detailed formulations for other TD learning methods. The key message is that {zk}\ncan be viewed as a jump parameter and TD learning methods are essentially just MJLS. Notice that all\nthe TD learning algorithms that can be analyzed using the ODE method are in the form of (6). Jump\nsystem perspectives for other TD learning algorithms are discussed in the supplementary material.\ni Ai. In this paper, we will\nassume \u00afA is Hurwitz. This assumption is standard and even required by the ODE approach. For the\nstandard TD method, \u00afA is Hurwitz when the discount factor is smaller than 1, p\u221e\nis positive for\nall i, and the feature matrix is full column rank [51]. It is worth emphasizing that the assumption\ni bi (cid:54)= 0. This case can still be handled using\na shifting argument since \u00afA is Hurwitz. Notice the iteration \u03bek+1 = (I + \u03b1A(zk))\u03bek + \u03b1b(zk) can be\nrewritten as \u03bek+1 \u2212 \u02dc\u03be = \u03bek \u2212 \u02dc\u03be + \u03b1\nfor any \u02dc\u03be. Now we denote\n\u02dc\u03be + bi and the above iteration just becomes \u03bek+1 \u2212 \u02dc\u03be = (I + \u03b1A(zk))(\u03bek \u2212 \u02dc\u03be) + \u03b1\u02dcb(zk).\n\u02dcbi = Ai\ni bi) such\ni=1 p\u221e\n\nRemark 1 (Assumptions). Denote \u00afA = limk\u2192\u221e EA(zk) = (cid:80)n\nlimk\u2192\u221e Eb(zk) = 0 is also general. Suppose(cid:80)n\nWhen \u00afA is Hurwitz (and hence invertible), we can choose \u02dc\u03be = \u2212((cid:80)n\nthat(cid:80)n\n\n(cid:17)\ni Ai)\u22121((cid:80)n\n\nA(zk)(\u03bek \u2212 \u02dc\u03be) + A(zk) \u02dc\u03be + b(zk)\n\n\u02dcbi =(cid:80)n\n\n\u02dc\u03be + bi) = 0.\n\ni=1 p\u221e\n\ni=1 p\u221e\n\ni=1 p\u221e\n\ni=1 p\u221e\n\ni=1 p\u221e\n\ni=1 Qk\n\ni\n\ni (Ai\n\n(cid:16)\n\n...\n\n...\n\n(12)\n\ni\n\nRemark 2 (Generality of (4)). Notice that (4) provides a general jump system model for linear\nstochastic schemes that may have more complicated forms than (5). However, (4) can not be directly\nused to cover nonlinear stochastic approximation schemes. See [53, 11] for recent \ufb01nite sample\nanalysis results on nonlinear stochastic approximation over non-IID data.\n\n5\n\n\fn(cid:88)\nn(cid:88)\n\ni=1\n\n4 Analysis under the IID assumption\nFor illustrative purposes, we \ufb01rst present the analysis for (6) under the IID assumption (P(zk = i) =\n\npi \u2200i). In this case, the analysis is signi\ufb01cantly simpler, since {E\u03bek} and {E(cid:0)\u03bek(\u03bek)T(cid:1)} directly\nform LTI systems with much smaller dimensions. We denote \u00b5k := E\u03bek and Qk := E(cid:0)\u03bek(\u03bek)T(cid:1).\n\nThen the following equations hold for the general jump system model (4)\n\n\u00b5k+1 =\n\npi(Hi\u00b5k + Gi) = \u00afH\u00b5k + \u00afG,\n\nvec(Qk+1) = (\n\npiHi \u2297 Hi) vec(Qk) +\n\ni=1\n\ni=1\n\n(cid:32) n(cid:88)\n\npi(Hi \u2297 Gi + Gi \u2297 Hi)\n\n(cid:33)\n\n\u00b5k +\n\nn(cid:88)\n\ni=1\n\npiGi \u2297 Gi.\n(13)\n\nThere are many ways to derive the above formulas. One way is to \ufb01rst show qk\ni = pi\u00b5k and\ni = piQk in this case and then apply (7) and (8). Another way is to directly modify the proof\nQk\nof Theorem 1 (which is presented in the supplementary material). Now consider the jump system\ni=1 pibi = 0. In this case, we have Hi = I + \u03b1Ai,\n\ni=1 piAi. We can directly obtain the following result.\n\nTheorem 1. Consider the jump system model (6) with Hi = I + \u03b1Ai, Gi = \u03b1bi, and yk = 1.\nSuppose {zk} is sampled from N using an IID distribution P(zk = i) = pi. In addition, assume\n\ni=1 pibi = 0. Then \u00b5k and vec(Qk) are governed by the following LTI system:\n\nmodel (6) under the assumption Eb(zk) =(cid:80)n\nGi = \u03b1bi, and yk = 1. Denote \u00afA :=(cid:80)n\n(cid:80)n\n(cid:20)H11\n\n(cid:20) \u00b5k+1\n\n0\nH21 H22\nwhere H11, H21 and H22 are determined as\n\nvec(Qk+1)\n\n(cid:21)\n\n=\n\n0\n\ni=1 pi(bi \u2297 bi)\n\n,\n\n(14)\n\n(cid:21)(cid:20) \u00b5k\n\n(cid:21)\n\nvec(Qk)\n\n+\n\n(cid:20)\n\u03b12(cid:80)n\n\n(cid:21)\n\nvec(Q\u221e) = lim\nk\u21920\n\nvec(Qk) = \u2212\u03b1\n\nI \u2297 \u00afA + \u00afA \u2297 I + \u03b1\n\npi(Ai \u2297 Ai)\n\npi(bi \u2297 bi)\n\nProof. For completeness, a detailed proof is presented in the supplementary material.\n\nNow we discuss the implications of the above theorem for TD learning. For simplicity, we denote\nH =\n\n.\n\n(cid:20)H11\n\n0\nH21 H22\n\n(cid:21)\n\nStability condition for TD learning. From the LTI theory, the system (14) is stable if and only if\nH is Schur stable. We can apply Proposition 3.6 in [14] to show that H is Schur stable if and only\nif H22 is Schur stable. Hence, a necessary and suf\ufb01cient stability condition for the LTI system (14)\nis that H22 is Schur stable. Under this condition, the \ufb01rst term on the right side of (16) converges\nto 0 at a linear rate speci\ufb01ed by \u03c3(H), and the second term on the right side of (16) is a constant\n\n6\n\nH11 = I + \u03b1 \u00afA,\nH21 = \u03b12\n\nn(cid:88)\n\ni=1\n\npi(Ai \u2297 bi + bi \u2297 Ai),\n\n+ \u03b1(I \u2297 \u00afA + \u00afA \u2297 I) + \u03b12\n\npi(Ai \u2297 Ai).\n\nIn addition, if \u03c3(H22) < 1, we have\n\n\u03be\n\nH22 = In2\n(cid:21)\n\n(cid:18)(cid:20)H11\n\n(cid:20) \u00b5k\n\n(cid:21)(cid:19)k(cid:18)(cid:20) \u00b50\n\n(cid:21)\n\nvec(Qk)\n\nvec(Q0)\nwhere \u00b5\u221e = limk\u2192\u221e \u00b5k = 0, and vec(Q\u221e) is given as\n\n=\n\n0\nH21 H22\n\n(cid:32)\n\ni=1\n\nn(cid:88)\n(cid:20) \u00b5\u221e\nn(cid:88)\n\n\u2212\n\nvec(Q\u221e)\n\nvec(Q\u221e)\n\n(cid:21)(cid:19)\n\n+\n\n(cid:20) \u00b5\u221e\n(cid:33)\u22121(cid:32) n(cid:88)\n\n(15)\n\n(16)\n\n(cid:21)\n\n(cid:33)\n\ni=1\n\ni=1\n\n(17)\n\n\fThen under mild technical condition2, we can ignore the quadratic term \u03b12(cid:80)n\n\nmatrix quantifying the steady state covariance. An important question for TD learning is how to\nchoose \u03b1 such that \u03c3(H22) < 1 for some given {Ai}, {bi}, and {pi}. We provide some clue to this\nquestion by applying an eigenvalue perturbation analysis to the matrix H22. We assume \u03b1 is small.\ni=1 pi(Ai \u2297 Ai) in the\nexpression of H22 and use \u03bbmax(In2\n\n+ \u03b1(I \u2297 \u00afA + \u00afA \u2297 I)) to estimate \u03bbmax(H22). We have\n\n\u03be\n\n\u03bbmax(H22) = 1 + 2\u03bbmax real( \u00afA)\u03b1 + O(\u03b12).\n\n(18)\nThen we immediately obtain \u03c3(H22) \u2248 1 + 2 real(\u03bbmax real( \u00afA))\u03b1 + O(\u03b12). Therefore, as long as \u00afA\nis Hurwitz, there exists suf\ufb01ciently small \u03b1 such that \u03c3(H22) < 1. More details of the perturbation\nanalysis are provided in the supplementary material.\n\n\u2212 H22)\u22121 ((cid:80)n\n\n\u03b1(cid:80)n\n\nExact limit for the mean square error of TD learning. Obviously, \u00b5k converges to 0 at the rate\nspeci\ufb01ed by \u03c3(I + \u03b1 \u00afA) due to the relation \u00b5k = (I + \u03b1 \u00afA)k\u00b50. Applying Proposition 3 and making\nuse of the block structure in H, one can show vec(Q\u221e) = \u03b12(In2\ni=1 pi(bi \u2297 bi)),\nwhich leads to the result in (17). A key message here is that the covariance matrix converges linearly to\nan exact limit under the stability condition \u03c3(H22) < 1. We can clearly see limk\u21920 vec(Qk) = O(\u03b1)\nand can be controlled by decreasing \u03b1. When \u03b1 is large, we need to keep the quadratic term\ni=1 pi(Ai \u2297 Ai). Therefore, our theory captures the steady-state behavior of TD learning for\nboth small and large \u03b1, and complement the existing \ufb01nite sample bounds in literatures. To further\ncompare our results with existing \ufb01nite sample bounds, we obtain the following result for the mean\nsquare error of TD learning.\nCorollary 1. Consider the TD update (12) with \u00afA being Hurwitz. Suppose \u03c3(H22) < 1 and P(zk =\ni) = pi \u2200i. Then limk\u2192\u221e E(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2 exists and is determined as \u03b4\u221e := limk\u2192\u221e E(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2 =\ntrace(Q\u221e) where Q\u221e is given by (17). In addition, the following mean square TD error bounds\nhold for some constant C0 and any arbitrary small positive \u03b5:\n\n\u03be\n\n\u03b4\u221e \u2212 C0(\u03c3(H) + \u03b5)k \u2264 E(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2 \u2264 \u03b4\u221e + C0(\u03c3(H) + \u03b5)k.\n\n(19)\nFinally, for suf\ufb01ciently small \u03b1, one has limk\u2192\u221e E(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2 = O(\u03b1). If \u03bbmax(In2\n+ \u03b1(I \u2297 \u00afA +\n\u00afA \u2297 I)) is a semisimple eigenvalue, then \u03c3(H) = \u03c3(H11) = 1 + real(\u03bbmax real( \u00afA))\u03b1 for small \u03b1.\nProof. Recall that we have E(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2 = trace(Qk). Taking limits on both sides leads to the\nexpression for \u03b4\u221e. Then we can apply Proposition 2 to obtain a linear convergence bound for Qk\nwhich eventually leads to (19). Notice \u00afA is assumed to be Hurwitz. Therefore, we can apply standard\nmatrix perturbation theory to show \u03b4\u221e = O(\u03b1) and \u03c3(H) = \u03c3(H11) = 1 + real(\u03bbmax real( \u00afA))\u03b1 for\nsuf\ufb01ciently small \u03b1.\n\n\u03be\n\nThe above corollary gives both upper and lower bounds for the mean square error of TD learning.\nFrom the above result, the \ufb01nal TD estimation error is actually exactly on the order of O(\u03b1). This\njusti\ufb01es the tightness of the existing upper bounds for the \ufb01nal TD error up to a constant factor. From\nthe above corollary, we can also see that one can obtain a faster convergence rate at the price of\ngetting a bigger steady state error. This is consistent with the \ufb01nite sample bound in the literature\n[6, 45]. Since H21 = O(\u03b12), it is possible to tighten the rate as \u03c3(H22) \u2248 1 + 2 real(\u03bbmax real( \u00afA))\u03b1\nby allowing some extra error on the order of O(\u03b1). We omit the details for such modi\ufb01cations.\n\n5 Analysis under the Markov assumption\nNow we analyze the behaviors of TD learning under the general assumption that {zk} is a Markov\nchain. Recall that the augmented mean vector qk and the augmented covariance matrix Qk have been\nde\ufb01ned in Section 3. We can directly modify (9) to obtain the following result.\nTheorem 2. Consider the jump system model (6) with Hi = I + \u03b1Ai, Gi = \u03b1bi, and\nyk = 1. Suppose {zk} is a Markov chain sampled from N using the transition matrix P . In\naddition, de\ufb01ne pk\n\ni = P(zk = i) and set the augmented vector pk := (cid:2)pk\nClearly pk = (P T)kp0. Further denote the augmented vectors as b := (cid:2)bT\n\n(cid:3)T.\n(cid:3)T,\n\n. . . pk\nn\nbT\n. . .\nn\n\npk\n2\nbT\n2\n\n1\n\n1\n\n2One such condition is that \u03bbmax(In2\n\n\u03be\n\n+ \u03b1(I \u2297 \u00afA + \u00afA \u2297 I)) is a semisimple eigenvalue.\n\n7\n\n\fThen qk and vec(Qk) are governed by the following LTI model:\n\n\u02c6B =(cid:2)(b1 \u2297 b1)T\n(cid:20)\n\nqk+1\n\n0\nH21 H22\nwhere H11, H21 and H22 are given by\n\nvec(Qk+1)\n\n=\n\n. . .\n\n(cid:21)\n\nH11 = (P T \u2297 In\u03be ) diag(In\u03be + \u03b1Ai),\n\n+\n\nqk\n\n(cid:21)\n\n(cid:35)\n\n(cid:34)\n\nvec(Qk)\n\n\u03b1((P T diag(pk\n\u03b12((P T diag(pk\n\ni )) \u2297 In\u03be )b\ni )) \u2297 In2\n) \u02c6B\n\n(bn \u2297 bn)T(cid:3)T, and set S(bi, Ai) := (bi \u2297 (I + \u03b1Ai) + (I + \u03b1Ai) \u2297 bi).\n(cid:20)H11\n(cid:21)(cid:20)\n\uf8ee\uf8ef\uf8f0p11S(b1, A1)\nk\u22121(cid:88)\n\n. . .\n...\n. . . pnnS(bn, An)\n\n) diag((In\u03be + \u03b1Ai) \u2297 (In\u03be + \u03b1Ai)).\n\n\uf8f9\uf8fa\uf8fb ,\n\npn1S(bn, An)\n\np1nS(b1, A1)\n\n(21)\n\n(20)\n\n...\n\n...\n\n,\n\n\u03be\n\n\u03be\n\nH22 = (P T \u2297 In2\n\nH21 = \u03b1\n\nIn addition, the following closed-form solution holds for any k\n\nqk = (H11)kq0 + \u03b1\n\nt=0\n\nvec(Qk) = (H22)k vec(Q0) +\n\n(H11)k\u22121\u2212t((P T diag(pt\n\ni)) \u2297 In\u03be )b,\n\n(H22)k\u22121\u2212t(cid:16)H21qt + \u03b12((P T diag(pt\n\nk\u22121(cid:88)\n\n(cid:17)\n\n(22)\n\n,\n\ni)) \u2297 In2\n\n\u03be\n\n) \u02c6B\n\nwhere H11, H21 and H22 are determined by (21).\n\nt=0\n\nProof. A detailed proof is presented in the supplementary material. We present a proof sketch here.\nNotice (20) is a direct consequence of (7) and (8) (which are special cases of Proposition 3.35 in [14]).\nSpeci\ufb01cally, it is straightforward to verify the following equations using the Markov assumption\n\n(cid:1) ,\n\ni + \u03b1pk\n\ni bi\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\n(cid:0)(I + \u03b1Ai)qk\n(cid:0)(I + \u03b1Ai)Qk\n\npij\n\npij\n\nqk+1\nj =\n\nQk+1\n\nj =\n\n(cid:1) .\n\n(23)\n\n(24)\n\ni (I + \u03b1Ai)T + 2\u03b1 sym((I + \u03b1Ai)qk\n\ni bT\n\ni ) + \u03b12pk\n\ni bibT\ni\n\ni=1\n\nThen we can apply the basic property of the vectorization operation to obtain (20). Applying (2) to\niterate (20) directly leads to (22).\n\nTherefore, the evolutions of qk and Qk can be fully understood via the well-established LTI system\ntheory. Now we discuss the implications of Theorem 2 for TD learning.\n\n\u03be\n\nStability condition for TD learning. Similar to the IID case, the necessary and suf\ufb01cient stability\ncondition is \u03c3(H22) < 1. Now H22 becomes a much larger matrix depending on the transition matrix\nP . An important question is how to choose \u03b1 such that \u03c3(H22) < 1 for some given {Ai}, {bi}, P ,\nand {p0}. Again, we perform an eigenvalue perturbation analysis for the matrix H22. This case is\nquite subtle due to the fact that we are no longer perturbing an identity matrix. We are perturbing the\nmatrix (P T \u2297 In2\n) and the eigenvalues here are not simple. Under the ergodicity assumption, the\nlargest eigenvalue for (P T \u2297 In2\n) (which is 1) is semisimple. Hence we can directly apply the results\nin Section II of [35] or Theorem 2.1 in [42] to show\n\n\u03bbmax(H22) = 1 + 2\u03bbmax real( \u00afA)\u03b1 + o(\u03b1),\n\n(25)\ni Ai and p\u221e is the unique stationary distribution of the Markov chain under the\nergodicity assumption. Then we still have \u03c3(H22) \u2248 1 + 2 real(\u03bbmax real( \u00afA))\u03b1 + o(\u03b1). Therefore,\nas long as \u00afA is Hurwitz, there exists suf\ufb01ciently small \u03b1 such that \u03c3(H22) < 1. This is consistent\nwith Assumption 3 in [45]. To understand the details of our perturbation argument, we refer the\nreaders to the remark placed after Theorem 2.1 in [42]. Notice we have\n\nwhere \u00afA =(cid:80)n\n\ni=1 p\u221e\n\n\u03be\n\nH22 = P T \u2297 In2\n\n+ \u03b1(P T \u2297 In2\n\n) diag(Ai \u2297 I + I \u2297 Ai) + O(\u03b12).\n\nThe largest eigenvalue of P T \u2297 In2\nis semisimple due to the ergodicity assumption. Then the\nperturbation result directly follows as a consequence of Theorem 2.1 in [42]. More explanations are\nalso provided in the supplementary material.\n\n\u03be\n\n\u03be\n\n\u03be\n\n8\n\n\fExact limit for the mean square TD error and related convergence rate. Assume the Markov\nchain {zk} is aperiodic and irreducible. Then we have pt \u2192 p\u221e at some linear rate where p\u221e is the\nstationary distribution. In this case, we can apply Proposition 3 to show that the mean square error of\nTD learning converges linearly to an exact limit.\nCorollary 2. Consider the TD update (12) with \u00afA being Hurwitz. Let {zk} be a Markov chain\nsampled from N using the transition matrix P . Suppose \u03c3(H22) < 1. We set N = nn2\n\u03be. If we assume\npk \u2192 p\u221e where p\u221e is the stationary distribution for {zk}, then we have\n\nvec(Qk) = \u03b12(IN \u2212 H22)\u22121(cid:16)\nk\u2192\u221e qk = \u03b1(I \u2212 H11)\u22121((P T diag(p\u221e\nq\u221e = lim\nvec(Q\u221e) = lim\nk\u21920\n\u03b4\u221e = lim\nk\u2192\u221e\n\nE(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2 = (1T\n\nn \u2297 vec(In\u03b8 )T) vec(Q\u221e).\n\ni )) \u2297 In\u03be )b,\n\n\u03b1\u22122H21q\u221e + ((P T diag(p\u221e\n\ni )) \u2297 In2\n\n) \u02c6B\n\n\u03be\n\n,\n\n(26)\n\n(cid:17)\n\nDue to the assumption(cid:80)n\n\nIf we further assume the geometric ergodicity, i.e. (cid:107)pk \u2212 p\u221e(cid:107) \u2264 C \u02dc\u03c1k, then we have\n\n\u03be\n\n\u03be\n\n+ \u03b1(P T \u2297 In2\n\n\u03b4\u221e \u2212 C0 max{\u03c3(H) + \u03b5, \u02dc\u03c1}k \u2264 E(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2 \u2264 \u03b4\u221e + C0 max{\u03c3(H) + \u03b5, \u02dc\u03c1}k,\n\n(27)\nwhere C0 is some constant and \u03b5 is an arbitrary small positive number. For suf\ufb01ciently small \u03b1,\nwe have \u03b4\u221e = O(\u03b1). If \u03bbmax(P T \u2297 In2\n) diag(Ai \u2297 I + I \u2297 Ai)) is a semisimple\neigenvalue, then we further have \u03c3(H) = 1 + real(\u03bbmax real( \u00afA))\u03b1 + o(\u03b1) for suf\ufb01ciently small \u03b1.\nn \u2297 vec(In\u03be )T) vec(Qk). We can directly apply Theorem 2,\nProof. Notice E(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2 = (1T\nProposition 1, and Proposition 3 to prove (26) and (27). When \u03b1 is small, we can apply the Laurent\nseries trick in [2, 24] to show that limk\u21920 vec(Qk) = O(\u03b1) and \u03b4\u221e = O(\u03b1). The dif\ufb01culty here\nis that IN \u2212 P T \u2297 In2\nis a singular matrix and hence (IN \u2212 H22)\u22121 does not have a Taylor series\naround \u03b1 = 0. Therefore, we need to apply some advanced matrix inverse perturbation result to\nperform a Laurent expansion of (IN \u2212 H22)\u22121. By using the ergodicity assumption and the matrix\ninverse perturbation theory in [2, 24], we can obtain the Laurent expansion of (IN \u2212 H22)\u22121 and\nshow limk\u21920 vec(Qk) = O(\u03b1). Consequently, we have \u03b4\u221e = O(\u03b1). By applying Theorem 2.1 in\n[42], we can show \u03c3(H) = 1 + real(\u03bbmax real( \u00afA))\u03b1 + o(\u03b1).\n\n\u03be\n\ni=1 p\u221e\n\ni bi = 0, we have limk\u2192\u221e qk (cid:54)= 0 in general but \u00b5\u221e = 0. Again, we\nhave obtained both upper and lower bounds for the mean square TD error. Our result states that under\nmild technical assumptions, the \ufb01nal TD error is actually exactly on the order of O(\u03b1). This justi\ufb01es\nthe tightness of the existing upper bounds for the \ufb01nal TD error [6, 45] up to a constant factor. From the\nabove corollary, we can also see the trade-off between the convergence rate and the steady state error.\nClearly, the convergence rate in (27) also depends on the initial distribution p0 and the mixing rate of\nthe underlying Markov jump parameter {zk} (which is denoted as \u02dc\u03c1). If the initial distribution is the\nstationary distribution, i.e. p0 = p\u221e, the input to the LTI dynamical system (20) is just a constant for\nall k and then we will be able to obtain an exact formula similar to (16). However, for a general initial\ndistribution p0, the mixing rate \u02dc\u03c1 matters more and may affect the overall convergence rate. One\nresultant guideline for algorithm design is that increasing \u03b1 may not increase the convergence rate\nwhen the mixing rate \u02dc\u03c1 dominates the convergence process. When \u03b1 becomes smaller and smaller,\neventually \u03c3(H) is going to become the dominating term and the mixing rate does not affect the\nconvergence rate any more. Similar to the IID case, for suf\ufb01ciently small \u03b1, it seems possible to\nobtain alternative upper bounds in the form of E(cid:107)\u03b8k \u2212 \u03b8\u2217(cid:107)2 \u2264 \u03b4\u221e + O(\u03b1) + C0(\u03c3(H22) + \u03b5)k\nwhere \u03c3(H22) \u2248 1 + 2 real(\u03bbmax real( \u00afA))\u03b1. Such modi\ufb01cations are not pursued in this paper.\nAlgorithm design. Here we make a remark on how our proposed MJLS framework can be further\nextended to provide clues for designing fast TD learning. When \u03b1 (or even other hyperparameters\nincluding momentum term) is changing with time, we can still obtain expressions of vec(Qk) and qk\nin an iterative form. However, both H and G depend on k now. Then given a \ufb01xed time budget T , in\ntheory it is possible to minimize the mean square estimation error at T subject to some optimization\n+ G(k)uk. One\nconstraints in the form of a time-varying iteration\nmay use this control-oriented optimization formulation to gain some theoretical insights on how\nto choose hyperparameters adaptively for fast TD learning. Clearly, solving such an optimization\nproblem requires knowing the underlying Markov model. However, this type of theoretical study\nmay lead to new hyperparameter tuning heuristics that do not require the model information.\n\n= H(k)\n\nvec(Qk+1)\n\nvec(Qk)\n\nqk+1\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\nqk\n\n9\n\n\fReferences\n[1] H. Abou-Kandil, G. Freiling, and G. Jank. On the solution of discrete-time Markovian jump linear quadratic\n\ncontrol problems. Automatica, 31(5):765\u2013768, 1995.\n\n[2] K. Avrachenkov and J. Lasserre. Analytic perturbation of generalized inverses. Linear Algebra and its\n\nApplications, 438(4):1793\u20131813, 2013.\n\n[3] N. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. Robust accelerated gradient method. arXiv\n\npreprint arXiv:1805.10579, 2018.\n\n[4] N. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated\n\nstochastic gradient method. arXiv preprint arXiv:1901.08022, 2019.\n\n[5] D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming, volume 5. Athena Scienti\ufb01c Belmont, 1996.\n\n[6] J. Bhandari, D. Russo, and R. Singal. A \ufb01nite time analysis of temporal difference learning with linear\n\nfunction approximation. arXiv preprint arXiv:1806.02450, 2018.\n\n[7] S. Bhatnagar, H. Prasad, and L. Prashanth. Stochastic recursive algorithms for optimization: simultaneous\n\nperturbation methods, volume 434. Springer, 2012.\n\n[8] V. Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.\n\n[9] V. Borkar and S. Meyn. The ODE method for convergence of stochastic approximation and reinforcement\n\nlearning. SIAM Journal on Control and Optimization, 38(2):447\u2013469, 2000.\n\n[10] C. Chen. Linear system theory and design. Oxford University Press, Inc., 1998.\n\n[11] Z. Chen, S. Zhang, T. Doan, S. Maguluri, and J. Clarke. Performance of Q-learning with linear function\n\napproximation: Stability and \ufb01nite-time analysis. arXiv preprint arXiv:1905.11425, 2019.\n\n[12] H. Chizeck, A. Willsky, and D. Castanon. Discrete-time Markovian-jump linear quadratic optimal control.\n\nInternational Journal of Control, 43(1):213\u2013231, 1986.\n\n[13] O. Costa and M. Fragoso. Stability results for discrete-time linear systems with Markovian jumping\n\nparameters. Journal of Mathematical Analysis and Applications, 179(1):154\u2013178, 1993.\n\n[14] O. Costa, M. Fragoso, and R. Marques. Discrete-time Markov jump linear systems. Springer Science &\n\nBusiness Media, 2006.\n\n[15] S. Cyrus, B. Hu, B. Van Scoy, and L. Lessard. A robust accelerated optimization algorithm for strongly\n\nconvex functions. In 2018 Annual American Control Conference (ACC), pages 1376\u20131381, 2018.\n\n[16] G. Dalal, B. Sz\u00f6r\u00e9nyi, G. Thoppe, and S. Mannor. Finite sample analyses for td (0) with function\n\napproximation. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[17] C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A survey and comparison.\n\nThe Journal of Machine Learning Research, 15(1):809\u2013883, 2014.\n\n[18] N. Dhingra, S. Khong, and M. Jovanovic. The proximal augmented lagrangian method for nonsmooth\n\ncomposite optimization. IEEE Transactions on Automatic Control, 2018.\n\n[19] L. El Ghaoui and M. Rami. Robust state-feedback stabilization of jump linear systems via LMIs. Interna-\n\ntional Journal of Robust and Nonlinear Control, 6(9-10):1015\u20131022, 1996.\n\n[20] Y. Fang and K. Loparo. Stochastic stability of jump linear systems. IEEE Transactions on Automatic\n\nControl, 47(7):1204\u20131208, 2002.\n\n[21] M. Fazlyab, A. Ribeiro, M. Morari, and V. Preciado. Analysis of optimization algorithms via integral\n\nquadratic constraints: Nonstrongly convex problems. arXiv preprint arXiv:1705.03615, 2017.\n\n[22] M. Fazlyab, A. Ribeiro, M. Morari, and V. Preciado. A dynamical systems perspective to convergence rate\nanalysis of proximal algorithms. In 2017 55th Annual Allerton Conference on Communication, Control,\nand Computing (Allerton), pages 354\u2013360, 2017.\n\n[23] X. Feng, K. Loparo, Y. Ji, and H. Chizeck. Stochastic stability properties of jump linear systems. IEEE\n\ntransactions on Automatic Control, 37(1):38\u201353, 1992.\n\n[24] P. Gonzalez-Rodriguez, M. Moscoso, and M. Kindelan. Laurent expansion of the inverse of perturbed,\n\nsingular matrices. Journal of Computational Physics, 299:307\u2013319, 2015.\n\n10\n\n\f[25] H. Gupta, R Srikant, and L. Ying. Finite-time performance bounds and adaptive learning rate selection for\n\ntwo time-scale reinforcement learning. In Advances in Neural Information Processing Systems, 2019.\n\n[26] S. Han. Systematic design of decentralized algorithms for consensus optimization. arXiv preprint\n\narXiv:1903.01023, 2019.\n\n[27] J. Hespanha. Linear systems theory. Princeton university press, 2009.\n\n[28] B. Hu and L. Lessard. Control interpretations for \ufb01rst-order optimization methods. In 2017 American\n\nControl Conference (ACC), pages 3114\u20133119, 2017.\n\n[29] B. Hu and L. Lessard. Dissipativity theory for Nesterov\u2019s accelerated method. In Proceedings of the 34th\n\nInternational Conference on Machine Learning, 2017.\n\n[30] B. Hu, P. Seiler, and L. Lessard. Analysis of approximate stochastic gradient using quadratic constraints\n\nand sequential semide\ufb01nite programs. arXiv preprint arXiv:1711.00987, 2017.\n\n[31] B. Hu, P. Seiler, and A. Rantzer. A uni\ufb01ed analysis of stochastic optimization methods using jump system\n\ntheory and quadratic constraints. In Conference on Learning Theory, pages 1157\u20131189, 2017.\n\n[32] B. Hu, S. Wright, and L. Lessard. Dissipativity theory for accelerating stochastic variance reduction: A\nuni\ufb01ed analysis of svrg and katyusha using semide\ufb01nite programs. In International Conference on Machine\nLearning, pages 2043\u20132052, 2018.\n\n[33] Y. Ji and H. Chizeck. Controllability, observability and discrete-time Markovian jump linear quadratic\n\ncontrol. International Journal of Control, 48(2):481\u2013498, 1988.\n\n[34] Y. Ji and H. Chizeck. Jump linear quadratic Gaussian control: Steady-state solution and testable conditions.\n\nControl Theory Adv. Technol., 6:289\u2013319, 1990.\n\n[35] T. Kato. Perturbation theory for linear operators, volume 132. Springer Science & Business Media, 2013.\n\n[36] H. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35.\n\nSpringer Science & Business Media, 2003.\n\n[37] C. Lakshminarayanan and C. Szepesvari. Linear stochastic approximation: How far does constant step-size\nand iterate averaging go? In International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n1347\u20131355, 2018.\n\n[38] D. Lee and N. He. Target-based temporal difference learning. arXiv preprint arXiv:1904.10945, 2019.\n\n[39] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic\n\nconstraints. SIAM Journal on Optimization, 26(1):57\u201395, 2016.\n\n[40] L. Lessard and P. Seiler. Direct synthesis of iterative algorithms with bounds on achievable worst-case\n\nconvergence rate. arXiv preprint arXiv:1904.09046, 2019.\n\n[41] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik. Finite-sample analysis of proximal gradient\n\ntd algorithms. In UAI, pages 504\u2013513, 2015.\n\n[42] J. Moro, J. Burke, and M. Overton. On the Lidskii\u2013Vishik\u2013Lyusternik perturbation theory for eigenvalues\nof matrices with arbitrary Jordan structure. SIAM Journal on Matrix Analysis and Applications, 18(4):793\u2013\n817, 1997.\n\n[43] Z. Nelson and E. Mallada. An integral quadratic constraint framework for real-time steady-state optimiza-\ntion of linear time-invariant systems. In 2018 Annual American Control Conference (ACC), pages 597\u2013603,\n2018.\n\n[44] P. Seiler and R. Sengupta. A bounded real lemma for jump systems. IEEE Transactions on Automatic\n\nControl, 48(9):1651\u20131654, 2003.\n\n[45] R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and TD learning.\n\narXiv preprint arXiv:1902.00923, 2019.\n\n[46] A. Sundararajan, B. Hu, and L. Lessard. Robust convergence analysis of distributed optimization algorithms.\nIn 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages\n1206\u20131212, 2017.\n\n[47] R. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9\u201344, 1988.\n\n11\n\n\f[48] R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT press, 2018.\n\n[49] R. Sutton, H. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00e1ri, and E. Wiewiora. Fast gradient-\ndescent methods for temporal-difference learning with linear function approximation. In Proceedings of\nthe 26th Annual International Conference on Machine Learning, pages 993\u20131000, 2009.\n\n[50] R. Sutton, C. Szepesv\u00e1ri, and H. Maei. A convergent o (n) algorithm for off-policy temporal-difference\nlearning with linear function approximation. In Proceedings of the 21st International Conference on\nNeural Information Processing Systems, pages 1609\u20131616, 2008.\n\n[51] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation.\n\nIEEE Transactions on Automatic Control, 42(5):674\u2013690, 1997.\n\n[52] B. Van Scoy, R. Freeman, and K. Lynch. The fastest known globally convergent \ufb01rst-order method for\n\nminimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49\u201354, 2017.\n\n[53] G. Wang, B. Li, and G. Giannakis. A multistep Lyapunov approach for \ufb01nite-time analysis of biased\n\nstochastic approximation. arXiv preprint arXiv:1909.04299, 2019.\n\n[54] T. Xu, S. Zou, and Y. Liang. Two time-scale off-policy TD learning: Non-asymptotic analysis over\n\nMarkovian samples. In Advances in Neural Information Processing Systems, 2019.\n\n12\n\n\f", "award": [], "sourceid": 4586, "authors": [{"given_name": "Bin", "family_name": "Hu", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Usman", "family_name": "Syed", "institution": "University of Illinois Urbana Champaign"}]}