{"title": "Reinforcement Learning Applied to Linear Quadratic Regulation", "book": "Advances in Neural Information Processing Systems", "page_first": 295, "page_last": 302, "abstract": null, "full_text": "Reinforcement Learning Applied to \n\nLinear Quadratic Regulation \n\nSteven J. Bradtke \n\nComputer Science Department \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \nbradtke@cs.umass.edu \n\nAbstract \n\nRecent research on reinforcement learning has focused on algo(cid:173)\nrithms based on the principles of Dynamic Programming (DP). \nOne of the most promising areas of application for these algo(cid:173)\nrithms is the control of dynamical systems, and some impressive \nresults have been achieved. However, there are significant gaps \nbetween practice and theory. In particular, there are no con ver(cid:173)\ngence proofs for problems with continuous state and action spaces, \nor for systems involving non-linear function approximators (such \nas multilayer perceptrons). This paper presents research applying \nDP-based reinforcement learning theory to Linear Quadratic Reg(cid:173)\nulation (LQR), an important class of control problems involving \ncontinuous state and action spaces and requiring a simple type of \nnon-linear function approximator. We describe an algorithm based \non Q-Iearning that is proven to converge to the optimal controller \nfor a large class of LQR problems. We also describe a slightly \ndifferent algorithm that is only locally convergent to the optimal \nQ-function, demonstrating one of the possible pitfalls of using a \nnon-linear function approximator with DP-based learning. \n\n1 \n\nINTRODUCTION \n\nRecent research on reinforcement learning has focused on algorithms based on the \nprinciples of Dynamic Programming. Some of the DP-based reinforcement learning \n\n295 \n\n\f296 \n\nBradtke \n\nalgorithms that have been described are Sutton's Temporal Differences methods \n(Sutton, 1988), Watkins' Q-Iearning (Watkins, 1989), and Werbos' Heuristic Dy(cid:173)\nnamic Programming (Werbos, 1987). However, there are few convergence results \nfor DP-based reinforcement learning algorithms, and these are limited to discrete \ntime, finite-state systems, with either lookup-tables or linear function approxima(cid:173)\ntors. Watkins and Dayan (1992) show that the Q-Iearning algorithm converges, \nunder appropriate conditions, to the optimal Q-function for finite-state Markovian \ndecision tasks, where the Q-function is represented by a lookup-table. Sutton (1988) \nand Dayan (1992) show that the linear TD(A) learning rule, when applied to Marko(cid:173)\nvian decision tasks where the states are representated by a linearly independent set \nof feature vectors, converges in the mean to Vu , the value function for a given con(cid:173)\ntrol policy U. Dayan (1992) also shows that linear TD(A) with linearly dependent \nstate representations converges, but not to Vu , the function that the algorithm is \nsupposed to learn. \n\nDespite the paucity of theoretical results, applications have shown promise. For \nexample, Tesauro (1992) describes a system using TD(A) that learns to play cham(cid:173)\npionship level backgammon entirely through self-playl. It uses a multilayer per(cid:173)\nceptron (MLP) trained using back propagation as a function approximator. Sofge \nand White (1990) describe a system that learns to improve process control with \ncontinuous state and action spaces. Neither of these applications, nor many similar \napplications that have been described, meet the convergence requirements of the \nexisting theory. Yet they produce good results experimentally. We need to extend \nthe theory of DP-based reinforcement learning to domains with continuous state \nand action spaces, and to algorithms that use non-linear function approximators. \n\nLinear Quadratic Regulation (e.g., Bertsekas, 1987) is a good candidate as a first \nattempt in extending the theory of DP-based reinforcement learning in this man(cid:173)\nner. LQR is an important class of control problems and has a well-developed theory. \nLQR problems involve continuous state and action spaces, and value functions can \nbe exactly represented by quadratic functions. The following sections review the \nbasics of LQR theory that will be needed in this paper, describe Q-functions for \nLQR, describe the Q-Iearning algorithm used in this paper, and describe an algo(cid:173)\nrithm based on Q-Iearning that is proven to converge to the optimal controller for a \nlarge class of LQR problems. We also describe a slightly different algorithm that is \nonly locally convergent to the optimal Q-function, demonstrating one of the possible \npitfalls of using a non-linear function approximator with DP-based learning. \n\n2 LINEAR QUADRATIC REGULATION \n\nConsider the deterministic, linear, time-invariant, discrete time dynamical system \ngiven by \n\n:Z:t+l \n\nUt \n\nf(:Z:t,Ut) \nA:Z:t + BUt \nU :Z:t, \n\nwhere A, B, and U are matrices of dimensions n x n, n x m, and m x n respectively. \n:Z:t is the state of the system at time t, and Ut is the control input to the system at \n\n1 Backgammon can be viewed as a Markovian decision task. \n\n\fReinforcement Learning Applied to Linear Quadratic Regulation \n\n297 \n\ntime t. U is a linear feedback controller. The cost at every time step is a quadratic \nfunction of the state and the control signal: \n\nrt \n\nr(zt, ud \nx~Ext + u~Fut, \n\nwhere E and F are symmetric, positive definite matrices of dimensions n x nand \nm x m respectively, and Z' denotes z transpose. \n\nThe value Vu (xe) of a state Zt under a given control policy U is defined as the \ndiscounted sum of all costs that will be incurred by using U for all times from \nt onward, i.e., Vi,(ze) = 2::o'Y'rt+i, where 0 :s: \n:s: 1 is the discount factor. \nLinear-quadratic control theory (e.g., Bertsekas, 1987) tells us that Vi, is a quadratic \nfunction of the states and can be expressed as Vu(zd = z~Kuzt, where Ku is the \nn x n cost matrix for policy U. The optimal control policy, U~, is that policy for \nwhich the value of every state is minimized. We denote the cost matrix for the \noptimal policy by K-. \n\n'Y \n\n3 Q-FUNCTIONS FOR LQR \nWatkins (1989) defined the Q-function for a given control policy U as Qu(z, u) = \nr(z, u) + 'YVu(f(x, u)). This can be expressed for an LQR problem as \n\nQu(z, u) \n\nr(z, u) + 'YVu(f(z, u)) \nZl Ez + u' Fu + 'Y(Az + BU)' Ku(Az + Bu) \n\n[ \nZ,U \n\n]' [ E + 'YA' Ku A \n\n'YB' Ku A \n\n'YA' Ku B 1 [ ] \nF + 'YB' Ku B \n\nz, u , \n\nwhere [z,u] is the column vector concatenation of the column vectors z and u. \n\nDefine the parameter matrix H u as \n\nH \n\n-\nu -\n\n[E+'YAIKU A \n\n'YB' Ku A \n\n(1) \n\n(2) \n\nHu is a symmetric positive definite matrix of dimensions (n + m) x (n + m). \n\n4 Q-LEARNING FOR LQR \n\nThe convergence results for Q-learning (Watkins & Dayan, 1992) assume a dis(cid:173)\ncrete time, finite-state system, and require the use of lookup-tables to represent \nthe Q-function. This is not suitable for the LQR domain, where the states and \nactions are vectors of real numbers. Following the work of others, we will use a \nparameterized representation of the Q-function and adjust the parameters through \na learning process. For example, Jordan and Jacobs (1990) and Lin (1992) use \nMLPs trained using backpropagation to approximate the Q-function. Notice that \nthe function Qu is a quadratic function of its arguments, the state and control ac(cid:173)\ntion, but it is a linear function of the quadratic combinations from the vector [z,u]. \nFor example, if z = [Zb Z2], and 1.1. = [1.1.1], then Qu(z,u) is a linear function of \n\n\f298 \n\nBradtke \n\nthe vector [x~, x~, ut, XIX2, XIUl, X2Ul]' This fact allows us to use linear Recursive \nLeast Squares (RLS) to implement Q-Iearning in the LQR domain. \n\nThere are two forms of Q-Iearning. The first is the rule \\Vatkins described in his \nthesis (Watkins, 1989) . Watkins called this rule Q-Iearning, but we will refer to it \nas optimizing Q-Iearning because it attempts to learn the Q-function of the optimal \npolicy directly. The optimizing Q-Iearning rule may be written as \n\nQt+I(Xt, Ut) = Qt(:et, Ut) + a [r(:et, ut) + 'Y mJn Qt(:et+l, a) - Qt(:et, Ut)] , \n\n(3) \n\nwhere Qt is the tth approximation to Q\". The second form of Q-Iearning attempts \nto learn Qu, the Q-function for some designated policy, U. U mayor may not be \nthe policy that is actually followed during training. This policy-based Q-learning \nrule may be written as \n\nQt+I (:et, Ut) = Qt(:et, Ut) + a [r( :et, Ut) + 'YQd :et+l, U :et+l) - Qt( :et, ue)] , \n\n(4) \nwhere Qt is the t lh approximation to Qu. Bradtke, Ydstie, and Barto (paper in \npreparation) show that a linear RLS implementation of the policy-based Q-Iearning \nrule will converge to Qu for LQR problems. \n\n5 POLICY IMPROVEMENT FOR LQR \n\nGiven a policy Uk, how can we find an improved policy, Uk+l? Following Howard \n(1960) , define Uk+l as \n\nUk+lX = argmin [r(x, '1.\u00a3) + 'Y11ul< U(:e, '1.\u00a3))]. \n\nu \n\nBut equation (1) tells us that this can be rewritten as \nUk+I:e = argmin QUI< (:e, u). \n\nu \n\nWe can find the minimizing '1.\u00a3 by taking the partial derivative of QUI\u00ab:e, u) with \nrespect to '1.\u00a3, setting that to zero, and solving for u. This yields \n\n'1.\u00a3 = -'Y (F + 'YB' KUI