{"title": "Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator", "book": "Advances in Neural Information Processing Systems", "page_first": 8514, "page_last": 8524, "abstract": "We study the sample complexity of approximate policy iteration (PI) for the Linear Quadratic Regulator (LQR), building on a recent line of work using LQR as a testbed to understand the limits of reinforcement learning (RL) algorithms on continuous control tasks. Our analysis quantifies the tension between policy improvement and policy evaluation, and suggests that policy evaluation is the dominant factor in terms of sample complexity. Specifically, we show that to obtain a controller that is within $\\varepsilon$ of the optimal LQR controller, each step of policy evaluation requires at most $(n+d)^3/\\varepsilon^2$ samples, where $n$ is the dimension of the state vector and $d$ is the dimension of the input vector. On the other hand, only $\\log(1/\\varepsilon)$ policy improvement steps suffice, resulting in an overall sample complexity of $(n+d)^3 \\varepsilon^{-2} \\log(1/\\varepsilon)$. We furthermore build on our analysis and construct a simple adaptive procedure based on $\\varepsilon$-greedy exploration which relies on approximate PI as a sub-routine and obtains $T^{2/3}$ regret, improving upon a recent result of Abbasi-Yadkori et al. 2019.", "full_text": "Finite-time Analysis of Approximate Policy Iteration\n\nfor the Linear Quadratic Regulator\n\nKarl Krauth\n\nUniversity of California, Berkeley\n\nkarlk@berkeley.edu\n\nStephen Tu\n\nUniversity of California, Berkeley\n\nstephentu@berkeley.edu\n\nBenjamin Recht\n\nUniversity of California, Berkeley\n\nbrecht@berkeley.edu\n\nAbstract\n\nWe study the sample complexity of approximate policy iteration (PI) for the\nLinear Quadratic Regulator (LQR), building on a recent line of work using LQR\nas a testbed to understand the limits of reinforcement learning (RL) algorithms\non continuous control tasks. Our analysis quanti\ufb01es the tension between policy\nimprovement and policy evaluation, and suggests that policy evaluation is the\ndominant factor in terms of sample complexity. Speci\ufb01cally, we show that to obtain\na controller that is within \u03b5 of the optimal LQR controller, each step of policy\nevaluation requires at most (n + d)3/\u03b52 samples, where n is the dimension of\nthe state vector and d is the dimension of the input vector. On the other hand,\nonly log(1/\u03b5) policy improvement steps suf\ufb01ce, resulting in an overall sample\ncomplexity of (n + d)3\u03b5\u22122 log(1/\u03b5). We furthermore build on our analysis and\nconstruct a simple adaptive procedure based on \u03b5-greedy exploration which relies\non approximate PI as a sub-routine and obtains T 2/3 regret, improving upon a\nrecent result of Abbasi-Yadkori et al. [3].\n\n1\n\nIntroduction\n\nWith the recent successes of reinforcement learning (RL) on continuous control tasks, there has been\na renewed interest in understanding the sample complexity of RL methods. A recent line of work\nhas focused on the Linear Quadratic Regulator (LQR) as a testbed to understand the behavior and\ntrade-offs of various RL algorithms in the continuous state and action space setting. These results\ncan be broadly grouped into two categories: (1) the study of model-based methods which use data to\nbuild an estimate of the transition dynamics, and (2) model-free methods which directly estimate the\noptimal feedback controller from data without building a dynamics model as an intermediate step.\nMuch of the recent progress in LQR has focused on the model-based side, with an analysis of robust\ncontrol from Dean et al. [12] and certainty equivalence control by Fiechter [17] and Mania et al.\n[26]. These techniques have also been extended to the online, adaptive setting [1, 4, 11, 13, 31]. On\nthe other hand, for classic model-free RL algorithms such as Q-learning, SARSA, and approximate\npolicy iteration (PI), our understanding is much less complete within the context ofd LQR. This is\ndespite the fact that these algorithms are well understood in the tabular (\ufb01nite state and action space)\nsetting. Indeed, most of the model-free analysis for LQR [16, 24, 35] has focused exclusively on\nderivative-free random search methods.\nIn this paper, we extend our understanding of model-free algorithms for LQR by studying the\nperformance of approximate PI on LQR, which is a classic approximate dynamic programming\nalgorithm. Approximate PI is a model-free algorithm which iteratively uses trajectory data to estimate\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe state-value function associated to the current policy (via e.g. temporal difference learning), and\nthen uses this estimate to greedily improve the policy. A key issue in analyzing approximate PI is\nto understand the trade-off between the number of policy improvement iterations, and the amount\nof data to collect for each policy evaluation phase. Our analysis quanti\ufb01es this trade-off, showing\nthat if least-squares temporal difference learning (LSTD-Q) [9, 20] is used for policy evaluation,\n\nthen a trajectory of length (cid:101)O((n + d)3/\u03b52) for each inner step of policy evaluation combined with\nO(log(1/\u03b5)) outer steps of policy improvement suf\ufb01ces to learn a controller that has \u03b5-error from\nthe optimal controller. This yields an overall sample complexity of O((n + d)3\u03b5\u22122 log(1/\u03b5)). Prior\nto our work, the only known guarantee for approximate PI on LQR was the asymptotic consistency\nresult of Bradtke [10] in the setting of no process noise.\nWe also extend our analysis of approximate PI to the online, adaptive LQR setting popularized by\nAbbasi-Yadkori and Szepesv\u00e1ri [1]. By using a greedy exploration scheme similar to Dean et al. [13]\n\nand Mania et al. [26], we prove a (cid:101)O(T 2/3) regret bound for a simple adaptive policy improvement\n[1, 11, 26], our analysis improves the (cid:101)O(T 2/3+\u03b5) regret (for T \u2265 C 1/\u03b5) from the model-free Follow\n\nalgorithm. While the T 2/3 rate is sub-optimal compared to the T 1/2 regret from model-based methods\n\nthe Leader (FTL) algorithm of Abbasi-Yadkori et al. [3]. To the best of our knowledge, we give the\nbest regret guarantee known for a model-free algorithm. We leave open the question of whether or\nnot a model-free algorithm can achieve optimal T 1/2 regret.\n\n2 Main Results\n\nIn this paper, we consider the following linear dynamical system:\n\nxt+1 = Axt + But + wt , wt \u223c N (0, \u03c32\n\n(2.1)\nWe let n denote the dimension of the state xt and d denote the dimension of the input ut. For\nsimplicity we assume that d \u2264 n, e.g. the system is under-actuated. We \ufb01x two positive de\ufb01nite cost\nmatrices (S, R), and consider the in\ufb01nite horizon average-cost Linear Quadratic Regulator (LQR):\n\nwI) , x0 \u223c N (0, \u03a30) .\n\n(cid:34)\n\nT(cid:88)\n\nt=1\n\n(cid:35)\n\nJ(cid:63) := min\n\n{ut(\u00b7)} lim\nT\u2192\u221e\n\nE\n\n1\nT\n\nxT\nt Sxt + uT\n\nt Rut\n\nsubject to (2.1) .\n\n(2.2)\n\nWe assume the dynamics matrices (A, B) are unknown to us, and our method of interaction with\n(2.1) is to choose an input sequence {ut} and observe the resulting states {xt}.\nWe study the solution to (2.2) using least-squares policy iteration (LSPI), a well-known approximate\ndynamic programming method in RL introduced by Lagoudakis and Parr [20]. The study of approxi-\nmate PI on LQR dates back to the Ph.D. thesis of Bradtke [10], where he showed that for noiseless\nLQR (when wt = 0 for all t), the approximate PI algorithm is asymptotically consistent. In this paper\nwe expand on this result and quantify non-asymptotic rates for approximate PI on LQR. Proofs of all\nresults can be found in the extended version of this paper [19].\nNotation. For a positive scalar x > 0, we let x+ = max{1, x}. A square matrix L is called stable\nif \u03c1(L) < 1 where \u03c1(\u00b7) denotes the spectral radius of L. For a symmetric matrix M \u2208 Rn\u00d7n, we\nlet dlyap(L, M ) denote the unique solution to the discrete Lyapunov equation P = LTP L + M.\nWe also let svec(M ) \u2208 Rn(n+1)/2 denote the vectorized version of the upper triangular part of\nM so that (cid:107)M(cid:107)2\nF = (cid:104)svec(M ), svec(M )(cid:105). Finally, smat(\u00b7) denotes the inverse of svec(\u00b7), so that\nsmat(svec(M )) = M.\n\n2.1 Least-Squares Temporal Difference Learning (LSTD-Q)\n\nThe \ufb01rst component towards an understanding of approximate PI is to understand least-squares\ntemporal difference learning (LSTD-Q) for Q-functions, which is the fundamental building block of\nLSPI. Given a deterministic policy Keval which stabilizes (A, B), the goal of LSTD-Q is to estimate\nthe parameters of the Q-function associated to Keval. Bellman\u2019s equation for in\ufb01nite-horizon average\ncost MDPs (c.f. Bertsekas [6]) states that the (relative) Q-function associated to a policy \u03c0 satis\ufb01es\nthe following \ufb01xed-point equation:\n\n\u03bb + Q(x, u) = c(x, u) + Ex(cid:48)\u223cp(\u00b7|x,u)[Q(x(cid:48), \u03c0(x(cid:48)))] .\n\n(2.3)\n\n2\n\n\fHere, \u03bb \u2208 R is a free parameter chosen so that the \ufb01xed-point equation holds. LSTD-Q operates\nunder the linear architecture assumption, which states that the Q-function can be described as\nQ(x, u) = qT\u03c6(x, u), for a known (possibly non-linear) feature map \u03c6(x, u). It is well known that\nLQR satis\ufb01es the linear architecture assumption, since we have:\n\nQ(x, u) = svec(Q)Tsvec\n\n, Q =\n\n(cid:32)(cid:20)x\n\n(cid:21)(cid:20)x\n\n(cid:21)T(cid:33)\n\nu\n\nu\n\n(cid:20)S 0\n\n(cid:21)\n\n0 R\n\n(cid:20)AT\n\n(cid:21)\n\nBT\n\n+\n\n(cid:42)\n\nV [A B] ,\n\n(cid:20) I\n\n(cid:21)(cid:20) I\n\nKeval\n\nKeval\n\n(cid:21)T(cid:43)\n\n.\n\nV = dlyap(A + BKeval, S + K T\n\nevalRKeval) , \u03bb =\n\nQ, \u03c32\nw\n\nHere, we slightly abuse notation and let Q denote the Q-function and also the matrix parameterizing\nthe Q-function. Now suppose that a trajectory {(xt, ut, xt+1)}T\nt=1 is collected. Note that LSTD-Q is\nan off-policy method (unlike the closely related LSTD estimator for value functions), and therefore the\ninputs ut can come from any sequence that provides suf\ufb01cient excitation for learning. In particular, it\ndoes not have to come from the policy Keval. In this paper, we will consider inputs of the form:\n\nut = Kplayxt + \u03b7t , \u03b7t \u223c N (0, \u03c32\n\n(2.4)\nwhere Kplay is a stabilizing controller for (A, B). Once again we emphasize that Kplay (cid:54)= Keval in\ngeneral. Furthermore, the policy under Keval is stochastic while the policy under Kplay is stochastic,\nwhere the injected noise \u03b7t is needed in order to provide suf\ufb01cient excitation for learning. In order to\ndescribe the LSTD-Q estimator, we de\ufb01ne the following quantities which play a key role throughout\nthe paper:\n\n\u03b7I) ,\n\n\u03c6t := \u03c6(xt, ut) , \u03c8t := \u03c6(xt, Kevalxt) ,\n\n(cid:21)(cid:20) I\n\n(cid:21)T(cid:33)\n\nf := svec\n\n\u03c32\nw\n\nKeval\n\nKeval\n\n, ct := xT\n\nt Sxt + uT\n\nt Rut .\n\nThe LSTD-Q estimator estimates q via:\n\n(cid:33)\u2020 T(cid:88)\n\n(cid:32)\n\n(cid:98)q :=\n\n(cid:20) I\n(cid:32) T(cid:88)\n\n\u03c6t(\u03c6t \u2212 \u03c8t+1 + f )T\n\n(2.5)\nbound on the quality of the estimator(cid:98)q, measured in terms of (cid:107)(cid:98)q \u2212 q(cid:107). Before we state our result, we\nHere, (\u00b7)\u2020 denotes the Moore-Penrose pseudo-inverse. Our \ufb01rst result establishes a non-asymptotic\nintroduce a key de\ufb01nition that we will use extensively.\nDe\ufb01nition 1. Let L be a square matrix. Let \u03c4 \u2265 1 and \u03c1 \u2208 (0, 1). We say that L is (\u03c4, \u03c1)-stable if\n\n\u03c6tct .\n\nt=1\n\nt=1\n\n(cid:107)Lk(cid:107) \u2264 \u03c4 \u03c1k , k = 0, 1, 2, ... .\n\nWhile stability of a matrix is an asymptotic notion, De\ufb01nition 1 quanti\ufb01es the degree of stability\nby characterizing the transient response of the powers of a matrix by the parameter \u03c4. It is closely\nrelated to the notion of strong stability from Cohen et al. [11].\nWith De\ufb01nition 1 in place, we state our \ufb01rst result for LSTD-Q.\nTheorem 2.1. Fix a \u03b4 \u2208 (0, 1). Let policies Kplay and Keval stabilize (A, B), and assume that both\nA + BKplay and A + BKeval are (\u03c4, \u03c1)-stable. Let the initial state x0 \u223c N (0, \u03a30) and consider\nthe inputs ut = Kplayxt + \u03b7t with \u03b7t \u223c N (0, \u03c32\n\u03b7I). For simplicity, assume that \u03c3\u03b7 \u2264 \u03c3w. Let P\u221e\ndenote the steady-state covariance of the trajectory {xt}:\nP\u221e = dlyap((A + BKplay)T, \u03c32\n\n(2.6)\n\n\u03b7BBT) .\n\nwI + \u03c32\n\nDe\ufb01ne the proxy variance \u03c32 by:\n\n\u03c32 := \u03c4 2\u03c14(cid:107)\u03a30(cid:107) + (cid:107)P\u221e(cid:107) + \u03c32\n\n\u03b7(cid:107)B(cid:107)2 .\n\n(cid:26)\n\nSuppose that T satis\ufb01es:\n\nT \u2265 (cid:101)O(1) max\n\n(n + d)2,\n\n\u03c4 4\n\n(n + d)4\n\n\u03c14(1 \u2212 \u03c12)2\n\n\u03c34\n\u03b7\n\n3\n\n\u03c32\nw\u03c32(cid:107)Kplay(cid:107)4\n\n+(cid:107)Keval(cid:107)8\n\n+((cid:107)A(cid:107)4 + (cid:107)B(cid:107)4)+\n\n.\n\n(2.7)\n\n(cid:27)\n\n(2.8)\n\n\fThen we have with probability at least 1 \u2212 \u03b4,\n\n(cid:107)(cid:98)q \u2212 q(cid:107) \u2264 (cid:101)O(1)\nHere the (cid:101)O(1) hides polylog(n, \u03c4,(cid:107)\u03a30(cid:107),(cid:107)P\u221e(cid:107),(cid:107)Kplay(cid:107), T /\u03b4, 1/\u03c3\u03b7) factors.\n\n(n + d)\n\u03b7\u221aT\n\u03c32\n\n+(cid:107)Keval(cid:107)4\n\n\u03c12(1 \u2212 \u03c12)\n\n\u03c4 2\n\nTheorem 2.1 states that:\n\n(2.9)\n\n+((cid:107)A(cid:107)2 + (cid:107)B(cid:107)2)+(cid:107)QKeval(cid:107)F .\n(cid:19)\n\n\u03c3w\u03c3(cid:107)Kplay(cid:107)2\n(cid:18)\n\n(n + d)4,\n\nT \u2264 (cid:101)O\n\n(n + d)3\n\n1\n\u03c34\n\u03b7\n\n\u03b52\n\ntimesteps are suf\ufb01cient to achieve error (cid:107)(cid:98)q \u2212 q(cid:107) \u2264 \u03b5 w.h.p. Several remarks are in order. First, while\n\nthe (n + d)4 burn-in is likely sub-optimal, the (n + d)3/\u03b52 dependence is sharp as shown by the\nasymptotic results of Tu and Recht [35]. Second, the 1/\u03c34\n\u03b7 dependence on the injected excitation\nnoise will be important when we study the online, adaptive setting in Section 2.3. We leave improving\nthe polynomial dependence of the burn-in period to future work.\nThe proof of Theorem 2.1 rests on top of several recent advances. First, we build off the work of\nAbbasi-Yadkori et al. [3] to derive a new basic inequality for LSTD-Q which serves as a starting\npoint for the analysis. Next, we combine the small-ball techniques of Simchowitz et al. [33] with\nthe self-normalized martingale inequalities of Abbasi-Yadkori et al. [2]. While an analysis of LSTD-\nQ is presented in Abbasi-Yadkori et al. [3] (which builds on the analysis for LSTD from Tu and\nRecht [34]), a direct application of their result yields a 1/\u03c38\n\u03b7 dependence; the use of self-normalized\ninequalities is necessary in order to reduce this dependence to 1/\u03c34\n\u03b7.\n\n2.2 Least-Squares Policy Iteration (LSPI)\n\nWith Theorem 2.1 in place, we are ready to present the main results for LSPI. We describe two\nversions of LSPI in Algorithm 1 and Algorithm 2.\n\nAlgorithm 1 LSPIv1 for LQR\nInput: K0: initial stabilizing controller,\nN: number of policy iterations,\nT : length of rollout,\n\u03c32\n\u03b7: exploration variance,\n\u00b5: lower eigenvalue bound.\n1: Collect D = {(xk, uk, xk+1)}T\n(cid:98)Qt = Proj\u00b5(LSTDQ(D, Kt)).\nput uk = K0xk + \u03b7k, \u03b7k \u223c N (0, \u03c32\n2: for t = 0, ..., N \u2212 1 do\nKt+1 = G((cid:98)Qt). [See (2.10).]\n3:\n4:\n5: end for\n6: return KN .\n\nk=1 with in-\n\n\u03b7I).\n\nAlgorithm 2 LSPIv2 for LQR\nInput: K0: initial stabilizing controller,\nN: number of policy iterations,\nT : length of rollout,\n\u03c32\n\u03b7: exploration variance,\n\u00b5: lower eigenvalue bound.\n\n1: for t = 0, ..., N \u2212 1 do\nCollect Dt = {(x(t)\n2:\n(cid:98)Qt = Proj\u00b5(LSTDQ(D, Kt)).\nu(t)\nk + \u03b7(t)\nk = K0x(t)\nKt+1 = G((cid:98)Qt).\n\nk , u(t)\nk ,\u03b7(t)\n\nk+1)}T\n\nk , x(t)\nk=1,\nk \u223c N (0, \u03c32\n\n\u03b7I).\n\n3:\n4:\n5: end for\n6: return KN .\n\nIn Algorithms 1 and 2, Proj\u00b5(\u00b7) = arg minX=X T:X(cid:23)\u00b5\u00b7I(cid:107)X \u2212 \u00b7(cid:107)F is the Euclidean projection\nonto the set of symmetric matrices lower bounded by \u00b5 \u00b7 I. Furthermore, the map G(\u00b7) takes an\n(n + d) \u00d7 (n + d) positive de\ufb01nite matrix and returns a d \u00d7 n matrix:\n\n(cid:18)(cid:20)Q11 Q12\n\n(cid:21)(cid:19)\n\nQT\n\n12 Q22\n\nG\n\n= \u2212Q\u22121\n\n22 QT\n\n12 .\n\n(2.10)\n\nAlgorithm 1 corresponds to the version presented in Lagoudakis and Parr [20], where all the data\nD is collected up front and is re-used in every iteration of LSTD-Q. Algorithm 2 is the one we will\nanalyze in this paper, where new data is collected for every iteration of LSTD-Q. The modi\ufb01cation\nmade in Algorithm 2 simpli\ufb01es the analysis by allowing the controller Kt to be independent of the\ndata Dt in LSTD-Q. We remark that this does not require the system to be reset after every iteration\nof LSTD-Q. We leave analyzing Algorithm 1 to future work.\n\n4\n\n\fBefore we state our main \ufb01nite-sample guarantee for Algorithm 2, we review the notion of a (relative)\nvalue-function. Similarly to (relative) Q-functions, the in\ufb01nite horizon average-cost Bellman equation\nstates that the (relative) value function V associated to a policy \u03c0 satis\ufb01es the \ufb01xed-point equation:\n\n\u03bb + V (x) = c(x, \u03c0(x)) + Ex(cid:48)\u223cp(\u00b7|x,\u03c0(x))[V (x(cid:48))] .\n\n(2.11)\n\nFor a stabilizing policy K, it is well known that for LQR the value function V (x) = xTV x with\n\nV = dlyap(A + BK, S + K TRK) , \u03bb = (cid:104)\u03c32\n\nwI, V (cid:105) .\n\nOnce again as we did for Q-functions, we slightly abuse notation and let V denote the value function\nand the matrix that parameterizes the value function. Our main result for Algorithm 2 appears in the\nfollowing theorem. For simplicity, we will assume that (cid:107)S(cid:107) \u2265 1 and (cid:107)R(cid:107) \u2265 1.\nTheorem 2.2. Fix a \u03b4 \u2208 (0, 1). Let the initial policy K0 input to Algorithm 2 stabilize (A, B).\nSuppose the initial state x0 \u223c N (0, \u03a30) and that the excitation noise satis\ufb01es \u03c3\u03b7 \u2264 \u03c3w. Recall that\nthe steady-state covariance of the trajectory {xt} is\n\nP\u221e = dlyap((A + BK0)T, \u03c32\n\nwI + \u03c32\n\n\u03b7BBT) .\n\nLet V0 denote the value function associated to the initial policy K0, and V(cid:63) denote the value function\nassociated to the optimal policy K(cid:63) for the LQR problem (2.2). De\ufb01ne the variables \u00b5, L as:\n\n\u00b5 := min{\u03bbmin(S), \u03bbmin(R)} ,\nL := max{(cid:107)S(cid:107),(cid:107)R(cid:107)} + 2((cid:107)A(cid:107)2 + (cid:107)B(cid:107)2 + 1)(cid:107)V0(cid:107)+ .\n\nFix an \u03b5 > 0 that satis\ufb01es:\n\n(cid:18) L\n\n(cid:19)2\n\n(cid:26)\n\n\u03b5 \u2264 5\n\n\u00b5\n\nmin\n\n1,\n\n2 log((cid:107)V0(cid:107)/\u03bbmin(V(cid:63)))\n\n,\n\ne\n\n(cid:107)V(cid:63)(cid:107)2\n\nSuppose we run Algorithm 2 for N := N0 + 1 policy improvement iterations where\n\n(cid:24)\n\n8\u00b52 log((cid:107)V0(cid:107)/\u03bbmin(V(cid:63)))\n\n(cid:18) 2 log((cid:107)V0(cid:107)/\u03bbmin(V(cid:63)))\n\n(cid:19)(cid:25)\n\n\u03b5\n\nN0 :=\n\n(1 + L/\u00b5) log\n\n,\n\n(2.13)\n\n(cid:27)\n\n.\n\n(2.12)\n\nand we set the rollout length T to satisfy:\n\n(cid:26)\n\nT \u2265 (cid:101)O(1) max\n\n(n + d)2,\n\nL2\n\n(1 \u2212 \u00b5/L)2\nL4\n1\n\u03b52\n\n(1 \u2212 \u00b5/L)2\n\n(cid:19)17 (n + d)4\n(cid:18) L\n(cid:19)42 (n + d)3\n(cid:18) L\n\n\u03c34\n\u03b7\n\n\u00b5\n\n\u00b5\n\n\u03c34\n\u03b7\n\n\u03c32\nw((cid:107)\u03a30(cid:107) + (cid:107)P\u221e(cid:107) + \u03c32\n\n\u03b7(cid:107)B(cid:107)2),\n\n(cid:27)\n\u03b7(cid:107)B(cid:107)2)\n\n\u03c32\nw((cid:107)\u03a30(cid:107) + (cid:107)P\u221e(cid:107) + \u03c32\n\n.\n\n(2.14)\n\nThen with probability 1 \u2212 \u03b4, we have that each policy Kt for t = 1, ..., N stabilizes (A, B) and\nfurthermore:\n\n(cid:107)KN \u2212 K(cid:63)(cid:107) \u2264 \u03b5 .\n\nHere the (cid:101)O(1) hides polylog(n, \u03c4,(cid:107)\u03a30(cid:107),(cid:107)P\u221e(cid:107), L/\u00b5, T /\u03b4, N0, 1/\u03c3\u03b7) factors.\nTheorem 2.2 states roughly that T \u00b7 N \u2264 (cid:101)O( (n+d)3\n\nlog(1/\u03b5)) samples are suf\ufb01cient for LSPI to\nrecover a controller K that is within \u03b5 of the optimal K(cid:63). That is, only log(1/\u03b5) iterations of policy\nimprovement are necessary, and furthermore more iterations of policy improvement do not necessary\nhelp due to the inherent statistical noise of estimating the Q-function for every policy Kt. We\nnote that the polynomial factor in L/\u00b5 is by no means optimal and was deliberately made quite\nconservative in order to simplify the presentation of the bound. A sharper bound can be recovered\nfrom our analysis techniques at the expense of a less concise expression.\nIt is worth taking a moment to compare Theorem 2.2 to classical results in the RL literature regarding\napproximate policy iteration. For example, a well known result (c.f. Theorem 7.1 of Lagoudakis and\n\n\u03b52\n\n5\n\n\fParr [20]) states that if LSTD-Q is able to return Q-function estimates with error L\u221e bounded by \u03b5\n\nat every iteration, then letting (cid:98)Qt denote the approximate Q-function at the t-th iteration of LSPI:\n\nt\u2192\u221e (cid:107)(cid:98)Qt \u2212 Q(cid:63)(cid:107)\u221e \u2264\n\nlim sup\n\n2\u03b3\u03b5\n\n(1 \u2212 \u03b3)2 .\n\n(2.15)\n\nHere, \u03b3 is the discount factor of the MDP. Theorem 2.2 is qualitatively similar to this result in that\nwe show roughly that \u03b5 error in the Q-function estimate translates to \u03b5 error in the estimated policy.\nHowever, there are several fundamental differences. First, our analysis does not rely on discounting\nto show contraction of the Bellman operator. Instead, we use the (\u03c4, \u03c1)-stability of closed loop system\nto achieve this effect. Second, our analysis does not rely on L\u221e bounds on the estimated Q-function,\nalthough we remark that similar type of results to (2.15) exist also in Lp (see e.g. [27, 29]). Finally,\nour analysis is non-asymptotic.\nThe proof of Theorem 2.2 combines the estimation guarantee of Theorem 2.1 with a new analysis\nof policy iteration for LQR, which we believe is of independent interest. Our new policy iteration\nanalysis combines the work of Bertsekas [7] on policy iteration in in\ufb01nite horizon average cost MDPs\nwith the contraction theory of Lee and Lim [22] for non-linear matrix equations.\n\n2.3 LSPI for Adaptive LQR\n\nT(cid:88)\n\nt=1\n\nWe now turn our attention to the online, adaptive LQR problem as studied in Abbasi-Yadkori and\nSzepesv\u00e1ri [1]. In the adaptive LQR problem, the quantity of interest is the regret, de\ufb01ned as:\n\nRegret(T ) :=\n\nxT\nt Sxt + uT\n\nt Rut \u2212 T \u00b7 J(cid:63) .\n\n(2.16)\n\nHere, the algorithm is penalized for the cost incurred from learning the optimal policy K(cid:63), and\nmust balance exploration (to better learn the optimal policy) versus exploitation (to reduce cost). As\n\nmentioned previously, there are several known algorithms which achieve (cid:101)O(\u221aT ) regret [1, 4, 11,\nsection to give an adaptive model-free algorithm for LQR which achieves (cid:101)O(T 2/3) regret, which\nimproves upon the (cid:101)O(T 2/3+\u03b5) regret (for T \u2265 C 1/\u03b5) achieved by the adaptive model-free algorithm\n\n26, 31]. However, these algorithms operate in a model-based manner, using the collected data to\nbuild a con\ufb01dence interval around the true dynamics (A, B). On the other hand, the performance of\nadaptive algorithms which are model-free is less well understood. We use the results of the previous\n\nof Abbasi-Yadkori et al. [3]. Our adaptive algorithm based on LSPI is shown in Algorithm 3.\n\n2i\n\n\u03b7,i = \u03c32\nw\n\nvalue bound \u00b5.\n\n(cid:1)1/3.\n\nAlgorithm 3 Online Adaptive Model-free LQR Algorithm\nInput: Initial stabilizing controller K (0), number of epochs E, epoch multiplier Tmult, lower eigen-\n(cid:0) 1\n1: for i = 0, ..., E \u2212 1 do\nSet Ti = Tmult2i.\n2:\nSet \u03c32\n3:\n4:\n5: end for\n\nSet K (i+1) = LSPIv2(K0=K (i), N =(cid:101)O((i + 1)\u0393(cid:63)/\u00b5), T =Ti, \u03c32\n\nUsing an analysis technique similar to that in Dean et al. [13], we prove the following (cid:101)O(T 2/3) regret\nbound for Algorithm 3.\nTheorem 2.3. Fix a \u03b4 \u2208 (0, 1). Let the initial feedback K (0) stabilize (A, B) and let V (0) denote\nits associated value function. Also let K(cid:63) denote the optimal LQR controller and let V(cid:63) denote\nthe optimal value function. Let \u0393(cid:63) = 1 + max{(cid:107)A(cid:107),(cid:107)B(cid:107),(cid:107)V (0)(cid:107),(cid:107)V(cid:63)(cid:107),(cid:107)K (0)(cid:107),(cid:107)K(cid:63)(cid:107),(cid:107)Q(cid:107),(cid:107)R(cid:107)}.\nSuppose that Tmult is set to:\n\n\u03b7=\u03c32\n\n\u03b7,i).\n\nand suppose \u00b5 is set to \u00b5 = min{\u03bbmin(S), \u03bbmin(R)}. With probability at least 1 \u2212 \u03b4, we have that\nthe regret of Algorithm 3 satis\ufb01es:\n\nTmult \u2265 (cid:101)O(1)poly(\u0393(cid:63), n, d, 1/\u03bbmin(S)) .\nRegret(T ) \u2264 (cid:101)O(1)poly(\u0393(cid:63), n, d, 1/\u03bbmin(S))T 2/3 .\n\n6\n\n\fWe note that the regret scaling as T 2/3 in Theorem 2.3 is due to the 1/\u03c34\n\u03b7 dependence from LSTD-Q\n(c.f. (2.9)). As mentioned previously, the existing LSTD-Q analysis from Abbasi-Yadkori et al. [3]\nyields a 1/\u03c38\n\u03b7 dependence in the analysis of Algorithm 3\nwould translate into T 4/5 regret.\n\n\u03b7 dependence in LSTD-Q; using this 1/\u03c38\n\n3 Related Work\n\nFor model-based methods, in the of\ufb02ine setting Fiechter [17] provided the \ufb01rst PAC-learning bound\nfor in\ufb01nite horizon discounted LQR using certainty equivalence (nominal) control. Later, Dean et al.\n[12] use tools from robust control to analyze a robust synthesis method for in\ufb01nite horizon average\ncost LQR, which is applicable in regimes of moderate uncertainty when nominal control fails. Mania\nthe online adaptive setting, [1, 4, 11, 18, 26] give (cid:101)O(\u221aT ) regret algorithms. A key component of\net al. [26] show that certainty equivalence control actually provides a fast O(\u03b52) rate of sub-optimality\nwhere \u03b5 is the size of the parameter error, unlike the O(\u03b5) sub-optimality guarantee of [12, 17]. For\nmodel-based algorithms is being able to quantify a con\ufb01dence interval for the parameter estimate, for\nwhich several recent works [14, 32, 33] provide non-asymptotic results.\nTurning to model-free methods, Tu and Recht [34] study the behavior of least-squares temporal\ndifference (LSTD) for learning the discounted value function associated to a stabilizing policy. They\nevaluate the LSPI algorithm studied in this paper empirically, but do not provide any analysis. In\nterms of policy optimization, most of the work has focused on derivative-free random search methods\n[16, 24]. Tu and Recht [35] study a special family of LQR instances and characterize the asymptotic\nbehavior of both model-based certainty equivalent control versus policy gradients (REINFORCE),\nshowing that policy gradients has polynomially worse sample complexity. Most related to our work\nis Abbasi-Yadkori et al. [3], who analyze a model-free algorithm for adaptive LQR based on ideas\nfrom online convex optimization. LSTD-Q is a sub-routine of their algorithm, and their analysis\nincurs a sub-optimal 1/\u03c38\n\u03b7 dependence on the injected exploration noise, which we improve to 1/\u03c34\n\u03b7\nusing self-normalized martingale inequalities [2]. This improvement allows us to use a simple greedy\nexploration strategy to obtain T 2/3 regret. Finally, as mentioned earlier, the Ph.D. thesis of Bradtke\n[10] presents an asymptotic consistency argument for approximate PI for discounted LQR in the\nnoiseless setting (i.e. wt = 0 for all t).\nFor the general function approximation setting in RL, Antos et al. [5] and Lazaric et al. [21] analyze\nvariants of LSPI for discounted MDPs where the state space is compact and the action space \ufb01nite. In\nLazaric et al. [21], the policy is greedily updated via an update operator that requires access to the\nunderlying dynamics (and is therefore not implementable). Farahmand et al. [15] extend the results\nof Lazaric et al. [21] to when the function spaces considered are reproducing kernel Hilbert spaces.\nZou et al. [37] give a \ufb01nite-time analysis of both Q-learning and SARSA, combining the asymptotic\nanalysis of Melo et al. [28] with the \ufb01nite-time analysis of TD-learning from Bhandari et al. [8].\nWe note that checking the required assumptions to apply the results of Zou et al. [37] is non-trivial\n(c.f. Section 3.1, [28]). We are un-aware of any non-asymptotic analysis of LSPI in the average cost\nsetting, which is more dif\ufb01cult as the Bellman operator is no longer a contraction.\nFinally, we remark that our LSPI analysis relies on understanding exact policy iteration for LQR,\nwhich is closely related to the \ufb01xed-point Riccati recurrence (value iteration). An elegant analysis for\nvalue iteration is given by Lincoln and Rantzer [23]. Recently, Fazel et al. [16] show that exact policy\niteration is a special case of Gauss-Newton and prove linear convergence results. Our analysis, on the\nother hand, is based on combining the \ufb01xed-point theory from Lee and Lim [22] with recent work on\npolicy iteration for average cost problems from Bertsekas [7].\n\n4 Experiments\n\nWe \ufb01rst look at the performance of LSPI in the non-adaptive setting (Section 2.2). Here, we compare\nLSPI to other popular model-free methods, and the model-based certainty equivalence (nominal)\ncontroller (c.f. [26]). For model-free, we look at policy gradients (REINFORCE) (c.f. [36]) and\nderivative-free optimization (c.f. [24, 25, 30]). A full description of the methods we compare to is\ngiven in the full paper [19].\n\n7\n\n\f(cid:34)0.95\n\n0.01\n\n0\n\n0.01\n0.95\n0.01\n\n(cid:35)\n\n0\n\n(cid:34)1\n\n, B =\n\n0.01\n0.95\n\n0\n0\n\n(cid:35)\n\n, S = I3,\n\n0.1\n0.1\n0.1\n\nWe consider the LQR instance (A, B, S, R) with A =\n\nand R = I2. We choose an LQR problem where the A matrix is stable, since the model-free methods\nwe consider need to be seeded with an initial stabilizing controller; using a stable A allows us to start\nat K0 = 02\u00d73. We \ufb01x the process noise \u03c3w = 1. The model-based nominal method learns (A, B)\nusing least-squares, exciting the system with Gaussian inputs ut with variance \u03c3u = 1.\n\n(a) Of\ufb02ine Evaluation\n\n(b) Adaptive Evaluation\n\nFigure 1: The performance of various model-free methods compared with the nominal controller. (a) Plot of\nnon-adaptive performance. The shaded regions represent the lower 10th and upper 90th percentile over 100\ntrials, and the solid line represents the median performance. Here, PG (simple) is policy gradients with the\nsimple baseline, PG (vf) is policy gradients with the value function baseline, LSPIv2 is Algorithm 2, LSPIv1 is\nAlgorithm 1, and DFO is derivative-free optimization. (b) Plot of adaptive performance. The shaded regions\nrepresent the median to upper 90th percentile over 100 trials. Here, LSPI is Algorithm 3 using LSPIv1, MFLQ\nis from Abbasi-Yadkori et al. [3], nominal is the \u03b5-greedy adaptive certainty equivalent controller (c.f. [13]), and\noptimal has access to the true dynamics.\n\nFor policy gradients and derivative-free optimization, we use the projected stochastic gradient\ndescent (SGD) method with a constant step size \u00b5 as the optimization procedure. For policy\niteration, we evaluate both LSPIv1 (Algorithm 1) and LSPIv2 (Algorithm 2). For every iteration of\nLSTD-Q, we project the resulting Q-function parameter matrix onto the set {Q : Q (cid:23) \u03b3I} with\n\u03b3 = min{\u03bbmin(S), \u03bbmin(R)}. For LSPIv1, we choose N = 15 by picking the N \u2208 [5, 10, 15] which\nresults in the best performance after T = 106 timesteps. For LSPIv2, we set (N, T ) = (3, 333333)\nwhich yields the lowest cost over the grid N \u2208 [1, 2, 3, 4, 5, 6, 7] and T such that N T = 106.\nNext, we compare the performance of LSPI in the adaptive setting (Section 2.3). We compare\nLSPI against the model-free linear quadratic control (MFLQ) algorithm of Abbasi-Yadkori et al. [3],\nnominal certainty equivalence controller (c.f. [13]), and the optimal controller. We use the example\n\nof Dean et al. [12], with A =\n\n, B = I, S = 10I3, R = I3, and \u03c3w = 1.\n\nFigure 1 shows the results of these experiments. In Figure 1a, we plot the relative error (J((cid:98)K) \u2212\n\nJ(cid:63))/J(cid:63) versus the number of timesteps. We see that the model-based certainty equivalence (nominal)\nmethod is more sample ef\ufb01cient than the other model-free methods considered. We also see that the\nvalue function baseline is able to dramatically reduce the variance of the policy gradient estimator\ncompared to the simple baseline. The DFO method performs the best out of all the model-free\nmethods considered on this example after 106 timesteps, although the performance of policy iteration\nis comparable. In Figure 1b, we plot the regret (c.f. Equation 2.16). We see that LSPI and MFLQ\nboth perform similarly with MFLQ slightly outperforming LSPI. We also note that the model-based\nnominal methods performs signi\ufb01cantly better than both LSPI and MFLQ.\n\n(cid:34)1.01 0.01\n\n0.01 1.01\n0.01\n\n0\n\n(cid:35)\n\n0\n\n0.01\n1.01\n\n8\n\n0200,000400,000600,000800,0001,000,000timesteps10\u2212810\u2212610\u2212410\u22122100relativeerrorPG(simple)PG(vf)LSPIv2LSPIv1DFOnominal0200040006000800010000timesteps025005000750010000125001500017500regretoptimalnominalLSPIMFLQ\f5 Conclusion\n\nWe studied the sample complexity of approximate PI on LQR, showing that roughly (n +\nd)3\u03b5\u22122 log(1/\u03b5) samples are suf\ufb01cient to estimate a controller that is within \u03b5 of the optimal. We\nalso show how to turn this of\ufb02ine method into an adaptive LQR method with T 2/3 regret. Several\nquestions remain open with our work. The \ufb01rst is if policy iteration is able to achieve T 1/2 regret,\nwhich is possible with other model-based methods. The second is whether or not model-free methods\nprovide advantages in situations of partial observability for LQ control. Finally, an asymptotic\nanalysis of LSPI, in the spirit of Tu and Recht [35], is of interest in order to clarify which parts of our\nanalysis are sub-optimal due to the techniques we use versus are inherent in the algorithm.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their valuable feedback. We also thank the authors of Abbasi-\nYadkori et al. [3] for providing us with an implementation of their model-free LQ algorithm. ST\nis supported by a Google PhD fellowship. This work is generously supported in part by ONR\nawards N00014-17-1-2191, N00014-17-1-2401, and N00014-18-1-2833, the DARPA Assured Auton-\nomy (FA8750-18-C-0101) and Lagrange (W911NF-16-1-0552) programs, a Siemens Futuremakers\nFellowship, and an Amazon AWS AI Research Award.\n\nReferences\n[1] Y. Abbasi-Yadkori and C. Szepesv\u00e1ri. Regret Bounds for the Adaptive Control of Linear\n\nQuadratic Systems. In Conference on Learning Theory, 2011.\n\n[2] Y. Abbasi-Yadkori, D. P\u00e1l, and C. Szepesv\u00e1ri. Online Least Squares Estimation with Self-\nNormalized Processes: An Application to Bandit Problems. In Conference on Learning Theory,\n2011.\n\n[3] Y. Abbasi-Yadkori, N. Lazi\u00b4c, and C. Szepesv\u00e1ri. Model-Free Linear Quadratic Control via\n\nReduction to Expert Prediction. In AISTATS, 2019.\n\n[4] M. Abeille and A. Lazaric.\n\nImproved Regret Bounds for Thompson Sampling in Linear\n\nQuadratic Control Problems. In International Conference on Machine Learning, 2018.\n\n[5] A. Antos, C. Szepesv\u00e1ri, and R. Munos. Learning near-optimal policies with Bellman-residual\nminimization based \ufb01tted policy iteration and a single sample path. Machine Learning, 71(1):\n89\u2013129, 2008.\n\n[6] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II. 2007.\n[7] D. P. Bertsekas. Value and Policy Iterations in Optimal Control and Adaptive Dynamic Pro-\ngramming. IEEE Transactions on Neural Networks and Learning Systems, 28(3):500\u2013509,\n2017.\n\n[8] J. Bhandari, D. Russo, and R. Singal. A Finite Time Analysis of Temporal Difference Learning\n\nWith Linear Function Approximation. In Conference on Learning Theory, 2018.\n\n[9] J. Boyan. Least-Squares Temporal Difference Learning.\n\nMachine Learning, 1999.\n\nIn International Conference on\n\n[10] S. J. Bradtke. Incremental Dynamic Programming for On-Line Adaptive Optimal Control. PhD\n\nthesis, University of Massachusetts Amherst, 1994.\n\n[11] A. Cohen, T. Koren, and Y. Mansour. Learning Linear-Quadratic Regulators Ef\ufb01ciently with\n\nonly \u221aT Regret. arXiv:1902.06223, 2019.\n\n[12] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the Sample Complexity of the Linear\n\nQuadratic Regulator. arXiv:1710.01688, 2017.\n\n[13] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. Regret Bounds for Robust Adaptive Control\n\nof the Linear Quadratic Regulator. In Neural Information Processing Systems, 2018.\n\n9\n\n\f[14] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis. Finite Time Identi\ufb01cation in Unstable\n\nLinear Systems. Automatica, 96:342\u2013353, 2018.\n\n[15] A. Farahmand, M. Ghavamzadeh, C. Szepesv\u00e1ri, and S. Mannor. Regularized Policy Iteration\nwith Nonparametric Function Spaces. Journal of Machine Learning Research, 17(139):1\u201366,\n2016.\n\n[16] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi. Global Convergence of Policy Gradient Methods\nfor the Linear Quadratic Regulator. In International Conference on Machine Learning, 2018.\n[17] C.-N. Fiechter. PAC Adaptive Control of Linear Systems. In Conference on Learning Theory,\n\n1997.\n\n[18] M. Ibrahimi, A. Javanmard, and B. V. Roy. Ef\ufb01cient Reinforcement Learning for High Dimen-\n\nsional Linear Quadratic Systems. In Neural Information Processing Systems, 2012.\n\n[19] K. Krauth, S. Tu, and B. Recht. Finite-time Analysis of Approximate Policy Iteration for the\n\nLinear Quadratic Regulator. arXiv:1905.12842, 2019.\n\n[20] M. G. Lagoudakis and R. Parr. Least-Squares Policy Iteration. Journal of Machine Learning\n\nResearch, 4:1107\u20131149, 2003.\n\n[21] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-Sample Analysis of Least-Squares Policy\n\nIteration. Journal of Machine Learning Research, 13:3041\u20133074, 2012.\n\n[22] H. Lee and Y. Lim. Invariant metrics, contractions and nonlinear matrix equations. Nonlinearity,\n\n21(4):857\u2013878, 2008.\n\n[23] B. Lincoln and A. Rantzer. Relaxed Dynamic Programming. IEEE Transactions on Automatic\n\nControl, 51(8):1249\u20131260, 2006.\n\n[24] D. Malik, K. Bhatia, K. Khamaru, P. L. Bartlett, , and M. J. Wainwright. Derivative-Free\nMethods for Policy Optimization: Guarantees for Linear Quadratic Systems. In AISTATS, 2019.\n\n[25] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to\n\nreinforcement learning. In Neural Information Processing Systems, 2018.\n\n[26] H. Mania, S. Tu, and B. Recht. Certainty Equivalent Control of LQR is Ef\ufb01cient.\n\narXiv:1902.07826, 2019.\n\n[27] A. massoud Farahmand, R. Munos, and C. Szepesv\u00e1ri. Error Propagation for Approximate\n\nPolicy and Value Iteration. In Neural Information Processing Systems, 2010.\n\n[28] F. S. Melo, S. P. Meyn, and M. I. Ribeiro. An Analysis of Reinforcement Learning with Function\n\nApproximation. In International Conference on Machine Learning, 2008.\n\n[29] R. Munos. Error Bounds for Approximate Policy Iteration. In International Conference on\n\nMachine Learning, 2003.\n\n[30] Y. Nesterov and V. Spokoiny. Random Gradient-Free Minimization of Convex Functions.\n\nFoundations of Computational Mathematics, 17(2):527\u2013566, 2017.\n\n[31] Y. Ouyang, M. Gagrani, and R. Jain. Control of unknown linear systems with Thompson\nsampling. In 55th Annual Allerton Conference on Communication, Control, and Computing,\n2017.\n\n[32] T. Sarkar and A. Rakhlin. Near optimal \ufb01nite time identi\ufb01cation of arbitrary linear dynamical\n\nsystems. In International Conference on Machine Learning, 2019.\n\n[33] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht. Learning Without Mixing:\nTowards A Sharp Analysis of Linear System Identi\ufb01cation. In Conference on Learning Theory,\n2018.\n\n[34] S. Tu and B. Recht. Least-Squares Temporal Difference Learning for the Linear Quadratic\n\nRegulator. In International Conference on Machine Learning, 2018.\n\n10\n\n\f[35] S. Tu and B. Recht. The Gap Between Model-Based and Model-Free Methods on the Linear\n\nQuadratic Regulator: An Asymptotic Viewpoint. In Conference on Learning Theory, 2019.\n\n[36] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8(3\u20134):229\u2013246, 1992.\n\n[37] S. Zou, T. Xu, and Y. Liang. Finite-Sample Analysis for SARSA and Q-Learning with Linear\n\nFunction Approximation. arXiv:1902.02234, 2019.\n\n11\n\n\f", "award": [], "sourceid": 4597, "authors": [{"given_name": "Karl", "family_name": "Krauth", "institution": "UC berkeley"}, {"given_name": "Stephen", "family_name": "Tu", "institution": "UC Berkeley"}, {"given_name": "Benjamin", "family_name": "Recht", "institution": "UC Berkeley"}]}